<p><a name="sections"></a></p>


# Sections

- <a href="#for">For Loop</a><br>
 - <a href="#copy">For Loop to Copy</a><br>
- <a href="#while">While Loop</a><br>
- <a href="#error">Errors and Exceptions</a><br>
 - <a href="#built">Built-in Exceptions</a><br>
 - <a href="#handle">Handling Exceptions</a><br>
- <a href="#class">Classes</a><br>
 - <a href="#attri">Attributes and Methods</a><br>
 - <a href="#special">Special Name Method</a><br>
 - <a href="#inherence">Inherence</a><br>
- <a href="#pandas">Intro to Pandas</a><br>
 - <a href="#IO">I/O of Data Frame</a><br>
 - <a href="#index">Index and Column Names</a><br>
 - <a href="#select">Selection and Filtering</a><br>
   - <a href="#single">Select a Single Value</a><br>
   - <a href="#singleLine">Selecting a Single Row or Column</a><br>
   - <a href="#multipleLine">Selecting Multiple Rows or Columns</a><br>
   - <a href="#fancy">Fancy Indexing</a><br>
 - <a href="#sort">Sorting</a><br>
 - <a href="#manipulate">Data Manipulation</a><br>


<p><a name="for"></a></p>
# For loop

A simple example is printing the elements of a list. Since print is a statement, we can’t use it in a map.

In [1]:
words = ['a', 'b', 'c', 'd', 'e']
for w in words:
    print w,     # comma suppresses the newline

a b c d e


Recall that the range function generates a list of numbers:

In [2]:
for i in range(len(words)):
    print i, words[i]

0 a
1 b
2 c
3 d
4 e


In addition to the iteration variable taking on values in a list, you may want other variables to take on different values in each iteration.  You can accomplish this by “self-assigning” to those variables.  This loop sums the elements of a list:

In [3]:
primes = [2, 3, 5, 7, 11]
sum_ = 0
for p in primes:
    sum_ = sum_ + p
sum_

28

As another example, here is a different way to print a list with its elements numbered:


In [4]:
words = ['a', 'b', 'c', 'd', 'e']
i = 0
for w in words:
    print i, w
    i = i + 1

0 a
1 b
2 c
3 d
4 e


**Exercise 1**

- Print the list of prime numbers along with the running sums of those numbers:
```
primes = [2, 3, 5, 7, 11]
```
the result should be
```
2 2
3 5
5 10
7 17
11 28
```

- Print a list of strings with numbers determined by the lengths of the strings:
```
names = ['don', 'mike', 'vivian', 'saul']
```
Teh result should be 
```
0 don
3 mike
7 vivian
13 saul
```

In [5]:
#### Your code here

2 2
3 5
5 10
7 17
11 28


0 don
3 mike
7 vivian
13 saul


<p><a name="copy"></a></p>
## For Loop to Copy

In [6]:
names = ['don', 'mike', 'vivian', 'saul']
copy = []
for name in names:
    copy.append(name)
copy

['don', 'mike', 'vivian', 'saul']

You should prefer `map` if you have a choice, because it is more concise and more efficient. But some things are hard to do.  For example, doing running sums with a `map` is hard.  So this loop would be hard to write with `map`:

In [7]:
primes = [2, 3, 5, 7, 11]
prime_sums = []
sum_ = 0
for p in primes:
    sum_ = sum_ + p
    prime_sums.append(sum_)
prime_sums

[2, 5, 10, 17, 28]

**Exercise 2**

- Write a function `map_uc(l)` that takes a list of strings and returns a list of those same strings in all upper-case.  You know how to do that using map; do it this time using a for loop.  You’ll need to create a copy, as you did in the loop, and return that.

- In the previous exercise, you wrote a loop that produced this output:
```
0 don
3 mike
7 vivian
13 saul
```
For this exercise, modify that loop to put pairs of these values in a list, instead of printing them, producing: 
```
[[0, 'don'], [3, 'mike'], [7, 'vivian'], [13, 'saul']]
```

In [8]:
#### Your code here

['DON', 'MIKE', 'VIVIAN', 'SAUL']
[[0, 'don'], [3, 'mike'], [7, 'vivian'], [13, 'saul']]


<p><a name="while"></a></p>
# While Loop

- While loops are used when you do not know ahead of time how many iterations you will need:
 - Sum the elements of a list up to the first zero.
 - Newton’s method is used to find a zero of an equation.  It works by finding values that are closer and closer to the zero, until it finds a value “close enough.”  But it is not easy to know how many times it takes to get close enough.
 - Get input from a user until the user enters ‘quit’.

- With a while loop, you iterate until a given condition becomes false:
```
while condition:
   statements
```

As a first example, this loop prints integers from 0 to 9:

In [9]:
i = 0
while i < 10:
    print i
    i = i + 1

0
1
2
3
4
5
6
7
8
9


This for loop does the same thing:

In [10]:
for i in range(0, 10):
    print i

0
1
2
3
4
5
6
7
8
9


One thing we can do with while loops that is hard to do with for loops is to terminate early.  This loops adds up integers starting from 1 until the sum exceeds n:

In [11]:
n = 20
i = 1
sum_ = 0
while sum_ <= n:
    sum_ = sum_ + i
    i = i + 1
sum_

21

This loop is similar, but sums the numbers in a list. Let's create a list `L`:

In [12]:
L = [5, 10, 15, 20, 25]

n = 20
i = 0
sum_ = 0
while sum_ <= n:
    sum_ = sum_ + L[i]
    i = i + 1
    
sum_

30

When we iterate over a list like this, we should also test if we are going out of bounds. Without doing so, we might end up with:

In [13]:
n = 80
i = 1
sum_ = 0
while sum_ <= n:
    sum_ = sum_ + L[i]
    i = i + 1
    
sum_

IndexError: list index out of range

This problem can be fixed by modifying the header:

In [14]:
n = 80
i = 1
sum_ = 0
while sum_ <= n and i < len(L):
    sum_ = sum_ + L[i]
    i = i + 1
    
sum_

70

**Exercise 3**

- Now we’ll do similar loops, but terminate under different conditions.  If we’re iterating over a list, remember to check that the list index is not out of bounds.

 - Print the elements of a numeric list, up to the first even number.
 - Print the elements of a list of strings, up to the first string whose length exceeds 10.
 - Sum the even elements of a numeric list.  This loop is different in that it contains an if statement (without an else).

In [None]:
#### Your code here

**Break and Continue Statements**

The **break** statement immediately terminates the (for or while) loop it is in.  This provides a way to terminate the loop from within the middle of the body. 

The **continue** statement terminates the current iteration of the loop and goes back to the header.

- The loop below adds the values in a list, but ignores negative numbers, and stops if the number exceeds 100:


In [16]:
L = [10, -10, 20, -20, 30, -30, 40, -40, 50, -50, 60, -60]

sum_ = 0
for x in L:
    if x < 0:
        continue
    sum_ = sum_ + x
    if sum_ > 100:
        break
        
sum_

150

<p><a name="error"></a></p>
# Errors and Exceptions

- Exceptions are a language mechanism in Python (and many other languages) for handling unexpected and undesirable situations.  Typical examples are:
 - Opening a file that does not exist
 - Dividing by zero

- The exception mechanism allows a program to handle such situations gracefully, without creating a lot of extra code.

The mechanisms has two parts:  signal the exception; and catch the exception.
- Signal an exception:
```
raise Exception
```
- Catch exception: `try`
```
try:
        commands
except Exception:
        handle exception
```

Many predefined functions, or functions you import from modules, can throw exceptions.  For example, the function open below raise an error when a file indicated by filename does not exist.

We focus on handling the error first:

In [17]:
def openfile(filename, mode):
    try:
        f = open(filename, mode)
    except:
        print 'Error:', filename, 'does not exist'
        
openfile('nonexistent.txt', 'r')

Error: nonexistent.txt does not exist


<p><a name="built"></a></p>
## Built-in Exceptions

The previous except clause - with no specific exception named - catches all exceptions. However, it is best to be specific about what exceptions you want to catch, so that you won’t respond inappropriately. 

For example, the problem of the code below is that we specify a mode that does not exist, but the error message we print out is not true -- `existent.txt` does exist.

In [18]:
def openfile(filename, mode):
    try:
        f = open(filename, mode)
    except:
        print 'Error:', filename, 'does not exist'
        
openfile('existent.txt', 'no_such_mode')

Error: existent.txt does not exist


There are many different exceptions. Here are some of the most common:

- Exception:  the most general exception.
- TypeError:  the error when you give the wrong type to a function, e.g. `3 + []`
- ValueError:  the exception when you give a bad value (of the correct type), e.g. `int('abc')`
- IndexError:  when your list subscript is out of bounds, e.g. `[][0]`
- IOError:  when you try to open a non-existent file.

The complete list is here: https://docs.python.org/2/library/exceptions.html

We can see the type of an error as below:

In [19]:
def openfile(filename, mode):
    try:
        f = open(filename, mode)
    except Exception as e:
        print type(e)
        
openfile('nonexistent.txt', 'r')
openfile('existent.txt', 'no_such_mode')

<type 'exceptions.IOError'>
<type 'exceptions.ValueError'>


To deal with errors differently, we may use multiple exceptions.

The general form of the try statement, and the meaning of the various parts, is:

```
try:
    statements			# start by executing these
except name:
    statements			# execute if exception “name” was raised
...
except:
    statements			# execute if an exception was raised that is not named above
else:
    statements			# execute if no exception was raised
finally:
    statements			# execute no matter what
```

For example:

In [20]:
def openfile(filename, mode):
    try:
        f = open(filename, mode)
    except IOError:
        print 'File doesn\'t exist in this case.'
    except ValueError:
        print 'Likely to be wrong mode in this case.'
    except:
        print 'Some other error.'
    else:
        print 'No error'
    finally:
        print 'Everybody should have this!'

We test the code below:

In [21]:
print "openfile('nonexistent.txt', 'r')"
print '-'*50
openfile('nonexistent.txt', 'r')
print '\n'

print "openfile('existent.txt', 'no_such_mode')"
print '-'*50
openfile('existent.txt', 'no_such_mode')
print '\n'

print "openfile('existent.txt', 123)"
print '-'*50
openfile('existent.txt', 123)
print '\n'

print "openfile('existent.txt', 'r')"
print '-'*50
openfile('existent.txt', 'r')

openfile('nonexistent.txt', 'r')
--------------------------------------------------
File doesn't exist in this case.
Everybody should have this!


openfile('existent.txt', 'no_such_mode')
--------------------------------------------------
Likely to be wrong mode in this case.
Everybody should have this!


openfile('existent.txt', 123)
--------------------------------------------------
Some other error.
Everybody should have this!


openfile('existent.txt', 'r')
--------------------------------------------------
No error
Everybody should have this!


Exceptions can carry more information than just their type; they have attributes giving information specific to the error. In examples above we came up with our own error message; we may use the attributes instead. We see that the function `open` actually raises exception with great detail:

In [23]:
def openfile(filename, mode):
    try:
        f = open(filename, mode)
    except Exception as e:
        print 'Error: ', e.args
        
print "openfile('nonexistent.txt', 'r')"
print '-'*50
openfile('nonexistent.txt', 'r')
print '\n'

print "openfile('existent.txt', 'no_such_mode')"
print '-'*50
openfile('existent.txt', 'no_such_mode')
print '\n'

print "openfile('existent.txt', 123)"
print '-'*50
openfile('existent.txt', 123)
print '\n'

print "openfile('existent.txt', 'r)"
print '-'*50
openfile('existent.txt', 'r')

openfile('nonexistent.txt', 'r')
--------------------------------------------------
Error:  (2, 'No such file or directory')


openfile('existent.txt', 'no_such_mode')
--------------------------------------------------
Error:  ("mode string must begin with one of 'r', 'w', 'a' or 'U', not 'no_such_mode'",)


openfile('existent.txt', 123)
--------------------------------------------------
Error:  ('file() argument 2 must be string, not int',)


openfile('existent.txt', 'r)
--------------------------------------------------


<p><a name="handle"></a></p>
## Handling Exceptions

Dealing with the errors (not just printing the error message) could be complicated. One needs to consider all the possible scenario.

The exception mechanism lets a function signal an error condition and have it propagated to some earlier caller without the intervening functions having to know about it. Here is an example. The code below allows the user to enter a number and then compute the square root of its reciprocal.

In [24]:
def f():
    x = raw_input('Enter a number: ')
    x = float(x)
    return reciprocal(x)
        
def reciprocal(x):
    x = 1/x
    return newton_sqrt(x)

def newton_sqrt(x, error = 1e-6):
    y = x/2 + 1
    while abs(y**2-x)>error:
        y = (y + (x/y))/2
        
    return y

f()

Enter a number: 2


0.7071067829248019

There are some possible error that we need to take care of. For example, we might encounter `ZeroDevisionError` in `reciprocal`:

In [25]:
f()

Enter a number: 0


ZeroDivisionError: float division by zero

We now take care of this error. Notice that even the error occurs in `reciprocal`, we can take care of it in `f`. This is an example where we design function on the top of the others and we only need to take care of the errors in the "highest" level.

Below we would keep ask users to enter until they enter a valid input.

In [26]:
def f():
    while True:
        x = raw_input('Enter a number: ')
        x = float(x)
        try:
            return reciprocal(x)
            break
        except Exception as e:
            pass
        
def reciprocal(x):
    x = 1/x
    return newton_sqrt(x)

def newton_sqrt(x, error = 1e-6):
    y = x/2 + 1
    while abs(y**2-x)>error:
        y = (y + (x/y))/2
        
    return y

f()

Enter a number: 0
Enter a number: 1


1.0000000000131073

We may also define our own exception. For example, a negative number cannot have real square root. We may `raise` our own `ValueError` as below.

**Note**
- Why is it value error?
- Exception can be created by `raise` and no matter how deep the exception is raised, it can be taken care of in the highest level.
- Since we are dealing with different kinds of exception so we add a `print` statement to indicate which exception it actually is.

In [27]:
def f():
    while True:
        x = raw_input('Enter a number: ')
        x = float(x)
        try:
            return reciprocal(x)
            break
        except Exception as e:
            print e.args[0]  ### We add this line now that we are dealing 
                             ### with different kinds of exceptions
            pass
        
def reciprocal(x):
    x = 1/x
    return newton_sqrt(x)

def newton_sqrt(x, error = 1e-6):
    if x < 0:
        raise ValueError('Please enter a non-negative number.')
    y = x/2 + 1
    while abs(y**2-x)>error:
        y = (y + (x/y))/2
        
    return y

f()

Enter a number: -1
Please enter a non-negative number.
Enter a number: 1


1.0000000000131073

**Exercise 4**

- Call the function `f()` and enter a letter `a`. What kind of error is returned?
- Modify our code so that it asks users to input again when encountering this error.

In [28]:
#### Your code here

f()

Enter a number: a


ValueError: could not convert string to float: a

In [29]:
#### Your code here

Enter a number: 0
float division by zero
Enter a number: -1
Please enter a non-negative number.
Enter a number: a
could not convert string to float: a
Enter a number: 4


0.5000002293118679

<p><a name="class"></a></p>
# Classes

- Classes are a method of organizing code. The idea is common to virtually all programming languages designed in the past thirty years, including Ruby, Java, C++, JavaScript, Perl, Scala, etc.
- Classes are closely tied to objects.  A class is a syntactic construct that acts as a template for objects.
 - We first write a class.
 - Then we create objects according to the template provided by that class.  These are called “objects” or “instances” of the class.
- Understanding what classes are, when to use them, and how to use them can be useful. In the process, we'll learn the meaning of the term Object-Oriented Programming.

**Objects**
- “Everything is an object”
- In Python, every value - integer, string, list, tuple, whatever - is an object.
- By defining classes, you can in effect define your own type of data.  You can even define infix operators (like +) in your class.
- You can find the class of which an object is an instance by using the type function:

In [30]:
s = set([1,2,3])
type(s)

set

In [31]:
type(3)

int

In [32]:
type({1:2})

dict

- An object is a collection of values together with functions that can access those values.
 - The values have names, and are called fields, or attributes.
 - The functions are called methods.
- Together, the values represent some object and the methods are the operations you can perform on those objects.
 - For example, a ComplexNumber object would be represented by two numbers (the real and imaginary parts) and would have operations like plus and times.
 - A Library object would be represented by two lists: all the books it has, and the ones that are checked out.  The operations would include check_out_book(book) and return_book(book).
 
**Represent Dictionary**
- Python has a built-in dictionary, but we’ll pretend it doesn’t.  Again, the important thing is the operations we want our dictionary to support:
 - lookup(key) returns the value associated with a key
 - add(key, value) associates value with key (replacing whatever key might have been associated with before)
 - contains(key) says whether key has an associated value
- We want to use our new dictionary the same as the built-in dictionary (except for syntax):
```
d = Dictionary()
d.add('ny', 'albany')
d.add('nj', 'trenton')
d.lookup('ny') ---> albany
```

**Exercise 5**

- For this first exercise on classes, we won’t actually create a class, but will just create the functions needed to represent a dictionary object.  In the next class, we’ll create a class and turn these functions into methods.

- Define the Dictionary functions:
 - Dictionary():  Return an empty list.
 - add(d, k, v):  Modify d (list of pairs) so that (k, v) becomes the first pair.  This is a mutating operation.  (Remember insert on lists.)
 - lookup(d, k):  Use filter to lookup k.  Don’t worry about if k is in d.
 - contains(d, k):  Use filter to determine if k is defined in d.
- These should work as above, except that they don’t use object-oriented syntax:

```
d = Dictionary()
add(d, 'ny', 'albany')
add(d, 'nj', 'trenton')

lookup(d, 'ny') ---> 'albany'
contains(d, 'nj') ---> True
contains(d, 'nm') ---> False
```

In [2]:
#### Your code here

'albany'

<p><a name="attri"></a></p>
## Attributes and Methods

We’ll go into details in a bit, but here is the syntax to define a simple class:

```
class classname(object):
    def __init__(self):
        initialize representation by assigning to variables

    def methodname(self, ...args...):
        define method; change representation or
                  return value or both; use self.var to refer to
                  variable var defined in init.
```
In `__init__`, we assign the desired representation to one or more variables, so we just have to decide what their names will be.

Once we have this class, we can create instances of it:

```
newobj = classname()
```

We invoke methods using object notation:

```
newobj.methodname(...args...)
```
Note that even though we defined the method using ordinary function definition syntax, we call it using object syntax.  That is just because it is defined inside a class.

**Exercise 6**

- We will turn out dictionary functions into a class called Dictionary.
- We’ll get you started with the definition of `__init__`:
```
class Dictionary(object):
   def __init__(self):
      self.kv_pairs = []
```
- Define `add`, `lookup`, and `contains`.  These are identical to the definitions you gave before, except: The first argument should be named `self`; the list of pairs should be referred to as `self.kv_pairs`.

- Perform the task below with 
```
d = Dictionary()
add(d, 'ny', 'albany')
add(d, 'nj', 'trenton')
lookup(d, 'ny')
contains(d, 'nj')
contains(d, 'nm')
```

In [34]:
#### Your code here

'albany'

- We’re now going to go into more detail.  As our example, we’ll define a class Vector representing vectors in an n-dimensional space.
- We will start out with simple operations:  initialize a vector with a list of numbers; calculate the length of the vector in Euclidean space.
```
vec_1 = Vector([1,2,3])
vec_1.length() ---> 3.74165738677
```

- After that we will introduce ways to print elements, add two vectors, and other operations.

Here is how to initialize an object with an argument:

In [35]:
class Vector(object):
    def __init__(self, lis):
        self.coords = lis

`__init__()` always takes at least one argument, `self`, that refers to the object being created.  Variables of the form `self.name` constitute the attributes of the object, i.e. its representation. In this case, the representation is a list, self.coords.

When creating an instance of the class `Vector`, as shown on the previous slide, the `__init__()` method is invoked. It initializes the coords attribute of that instance. 

We can add methods to classes. For example, if we want to calculate the length of a vector, we can add:

In [36]:
class Vector(object):
    def __init__(self, lis):
        self.coords = lis

    def length(self):
        return sum([x**2 for x in self.coords])**.5

When length is invoked as “`v.length()`”, the instance `v` becomes the parameter self.  Then self.coords is used to refer to the coords attribute of the instance.
As noted earlier, we now can create an instance of `Vector` and access its method using dot notation.  We can also look at its attribute:

In [37]:
vec_1 = Vector([1,2,3])
print vec_1.coords
print vec_1.length()

[1, 2, 3]
3.74165738677


However it is poor style to make the attributes visible outside the class definition.  The problem is this:  Users of your class - “clients” - will come to depend upon these attributes.  If you want to change the representation of objects, you cannot do it because your clients’ code will break.  This may not sound like a big deal, but over time it is a very serious problem.

Python use prefix ‘`__`’  (two underscores) to hide the attributes and methods from being directly accessed outside an object.
Now prefix L with `__` and try to access it:

In [38]:
class Vector(object):
    def __init__(self, lis):
        self.__coords = lis


    def length(self):
        return sum([x**2 for x in self.__coords])**.5


v = Vector([1,2,3])
v.__coords 

AttributeError: 'Vector' object has no attribute '__coords'

<p><a name="special"></a></p>
## Special Name Method

- In Python, a class can implement certain operations that are invoked by special syntax (such as arithmetic operations or subscripting) by defining methods with special names.
- For example, the `__str__()` method is called by the `str()` built-in function and by the print statement to compute the string representation of an object.  E.g. add this method to `Vector`:

In [39]:
class Vector(object):
    def __init__(self, lis):
        self.coords = lis


    def length(self):
        return sum([x**2 for x in self.coords])**.5
    
    def __str__(self):
        return 'Vector' + str(self.coords)
    
# Then we print the Vector object:
vec_1 = Vector([1,2,3])
print vec_1

Vector[1, 2, 3]


**Emulating numeric types **

- For list objects, ‘+’ means to concatenate two lists.  For the Vector class we just created, we may want to do vector addition by using the expression u + v, where u and v are instances of Vector.
- In python we can implement the `__add__()` method:

In [40]:
class Vector(object):
    def __init__(self, lis):
        self.coords = lis


    def length(self):
        return sum([x**2 for x in self.coords])**.5
    
    def __str__(self):
        return 'Vector' + str(self.coords)
    
    def __add__(self, other):
        return Vector(map(lambda x, y: x+y, self.coords, other.coords))

Note that this method returns a new Vector object.  It is very common for non-mutating operations to return new objects in this way.

When we add two vector objects with ‘`+`’,`__add__()` is called:

In [41]:
u = Vector([1,2,3])
v = Vector([4,5,6])
w = u + v    # Python actually runs u.__add__(v)
print w

Vector[5, 7, 9]


** Exercise 7**

- Now our Vector class looks like this:

```
class Vector(object):
    def __init__(self, lis):
        self.coords = lis

    def length(self):
        return sum([x**2 for x in self.coords])**.5

    def __add__(self, other):
        return Vector(map(lambda x, y: x+y,
                          self.coords, other.coords))

    def __str__(self):
        return 'Vector'+str(self.coords)

```

- Add two more methods to the class:
`__eq__(vec)`: returns `True` iff this vector equals `vec`.
 - `u == v` calls `u.__eq__(v)`.
- `__mul__(vec)`: returns the dot product of this vector and `vec`. The dot product is defined by: (`x`, `y`, …) `*` (`x’`, `y’`, …) = `xx’ + yy’ +` … 
 - `u * v` calls `u.__mul__(v)`
 
- Then evaluate the following expressions ( equality and $cos(\theta)$):

```
u = Vector([1,1,0])
v = Vector([0,1,1])
print u == v                      
print (u*v) / (u.length()*v.length())
```

In [42]:
#### Your code here

<p><a name="inherence"></a></p>
## Inheriance

- Inheritance is another important feature of object-oriented programming.
- With inheritance, a class can be a “child” of another class, and inherit the attributes and methods of the class.
 - The “parent” is called the superclass or base class.
 - The “child” is the subclass or derived class.
- Every class has a superclass; this is the name given in parentheses in the class definition:
```
class Vector(object):
```
In Python, you should use object as the base class if you don’t want to inherit from any other class. 

To illustrate, we’ll start with a class called Book, representing a generic book, with a name and an author.  Its only operation is `__str__`.

In [43]:
class Book(object):
    def __init__(self, name, author = None):
        self.name = name
        self.author = author

    def __str__(self):
        return '<%s> by %s' %(self.name, self.author)

We can use inheritance to create classes representing specific types of books, e.g. paper books, ebooks.

We’ll create a subclass for e-books.  Note that every class has to have an `__init__` function.

In [44]:
class EBook(Book):
    def __init__(self, name, author = None):
        Book.__init__(self, name, author)

- Several things to point out here:
`EBook` is a subclass of Book, as shown in the first line.

- `__init__` has the same arguments as Book’s `__init__`.  (We’ll add specialized methods soon.)
- EBook inherits the attributes of Book.  It calls `Book.__init__` to initialize them.

`EBook` inherits the attributes and methods of `Book`.  From the definition above, `EBook` and `Book` do exactly the same things.

In [45]:
book_1 = Book('The little SAS book', None)
ebook_1 = EBook('R CookBook', None)
print book_1
print ebook_1        # inherited method from Book

<The little SAS book> by None
<R CookBook> by None


Note: try defining EBook without calling `Book.__init__()` to see what happens.

**Class Inheritance is a relationship**
- The key point about inheritance is this:  A subclass can be used wherever its superclass could be used.  That’s because the subclass has all the methods of the superclass, so any client using the superclass can also use the subclass.
- We say that an EBook “is a” Book, or, more generally, any object of a subclass is also an object of the superclass.  The way to think about inheritance is that derived classes define specialized instances of the base class.
- There are several operations designed to let you understand the types of objects and the subclass relationships:
 - type:  This will give the actual type of an object.

In [46]:
type(book_1)

__main__.Book

In [47]:
type(ebook_1)

__main__.EBook

- `issubclass`:  Check the relationship between two classes:

In [48]:
issubclass(EBook, Book)

True

In [49]:
issubclass(Book, EBook)

False

- `isinstance`:  Checks if an object is an object of a class or of any of its subclasses.

In [50]:
print isinstance(ebook_1, Book)

True


In [51]:
print isinstance(ebook_1, EBook)

True


In [52]:
print isinstance(book_1, Book)

True


In [53]:
print isinstance(book_1, EBook)

False


The first line above is the most interesting: `ebook_1` is considered an instance of Book even though it is actually an EBook object.

Note that subclasses can have subclasses. `isinstance(object, class)` will return true as long as object is in any descendant of class.

Derived classes can have their own attributes.  We can add a new attribute - format, which can be ‘pdf’, ‘kindle’, etc. - to EBook:

In [54]:
class EBook(Book):
    def __init__(self, name, fmt, author = None):
        Book.__init__(self, name, author)
        self.fmt = fmt

    def get_fmt(self):
        return self.fmt


ebook_2 = EBook('R CookBook', 'pdf')
print ebook_2.get_fmt()

pdf


Book already provides an `__str__()` method, but if we want EBook to do something different - say, to print the format as well - we can rewrite the `__str__()` method. This is called method overriding.

In [55]:
class EBook(Book):
    def __init__(self, name, fmt, author = None):
        Book.__init__(self, name, author)
        self.fmt = fmt
    def __str__(self):      # override __str__() method
        return Book.__str__(self) + ', format: '+ self.fmt

When we call `__str__()`, it invokes the method from EBook:

In [56]:
ebook_3 = EBook('R CookBook', 'pdf')
print ebook_3

<R CookBook> by None, format: pdf


Objects can be changed (mutated) simply by assigning to their attributes.  Here we allow for the title of a book to be changed:

In [57]:
class Book(object):
    def __init__(self, name, author = None):
        self.name = name
        self.author = author
    def __str__(self):
        return '<%s> by %s' %(self.name, self.author)
    def rename(self, newname):
        self.name = newname

book_1 = Book('The little SAS book', None)
print "The name of book_1 is originally %s" % book_1
book_1.rename('The SAS book')
print "The name of book_1 is now %s" % book_1

The name of book_1 is originally <The little SAS book> by None
The name of book_1 is now <The SAS book> by None


**Exercise 8**

- Add a new attribute and two methods to EBook:
  - size is the number of bytes in the EBook.  This should be added to `__init__` as an argument with default 0, and should be included in the string representation.
  - `get_size()` returns the size.
  - `compress()` divides the size in half.


In [58]:
#### Your code here

<p><a name="pandas"></a></p>
# Intro to Pandas

Pandas is a popular package defining several new data types, plus a variety of convenience functions for data manipulation. The most important data type is `DataFrame`, which is inspired by the type of same name in R, a programming language popular among statisticians and data scientists. We first import the module by the keyword `import`.

In [36]:
import pandas as pd

We then create a data frame with the function `DataFrame` from pandas. Note that 

1. the name of the function called follows the module name. This is why we need `as` keyword to use abbreviation.
2. data can be recorded in a nested list. Each inner list represents a row.
3. **column names** can be specified within the `DataFrame` function.
4. pandas automatically increments from 0 to the number of rows (minus 1) in a data frame as the **index**.

In [37]:
nested_lst = [[1,2,3,4,'hello'],[5,6,7,8,'world'],[9,10,11,12,'foo']]
my_df = pd.DataFrame(nested_lst, columns = ['a','b','c','d','message'])
my_df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


<p><a name="IO"></a></p>
## I/O of Data Frame

Of course, a data frame is used for data manipulation and analysis. After we process the data, we often want to save result back to a csv file. We may do that with the `to_csv` function and its default setting. 

- Note that the index would be written into the first row of the csv file.

In [38]:
my_df.to_csv('my_df.csv')
!cat my_df.csv

,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


- Having the index in a column of a csv file can cause trouble, because pandas assumes whatever it receives is data. The index in the csv file becomes a unnamed column in the data frame, and pandas then assign again the index to the data frame.

In [39]:
pd.read_csv('my_df.csv')

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


- To avoid that, we can "tell" pandas which column in csv is actually the index (by integer).

In [40]:
pd.read_csv('my_df.csv', index_col=0)

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


When writing a data frame into a csv file, index doesn't need to be included:

In [41]:
my_df.to_csv('my_df.csv', index=False)
!cat my_df.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


- The default setting of `pd.read_csv` is designed to take care of this format of data.

In [42]:
pd.read_csv('my_df.csv')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


- Index column can be indicated by the column name.

In [43]:
pd.read_csv('my_df.csv', index_col='a')

Unnamed: 0_level_0,b,c,d,message
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


- By default the first row in the csv file is taken for the column name. If we specify `header=None` then the first row is treated as the first piece of data.

In [44]:
pd.read_csv('my_df.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,a,b,c,d,message
1,1,2,3,4,hello
2,5,6,7,8,world
3,9,10,11,12,foo


- If we don't even want the first row in csv to be included, we may use `skiprows` as below:

In [45]:
pd.read_csv('my_df.csv', header=None, skiprows=[0])

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


** Non comma separated**

A little more work need to be done in case the values are not separated by comma. For example, the `my_df.txt` file is not a csv file:

In [46]:
!cat my_df.txt

a	b	c	d	message
1	2	3	4	hello
5	6	7	8	world
9	10	11	12	foo


- Since values are separated by tab (`'\t'`), the default parameters of `read_csv` don't work as expected:

In [47]:
pd.read_csv('my_df.txt')

Unnamed: 0,a	b	c	d	message
0,1\t2\t3\t4\thello
1,5\t6\t7\t8\tworld
2,9\t10\t11\t12\tfoo


- We may fix this problem by specify `sep = '\t'` to set the delimiter. 

In [48]:
pd.read_csv('my_df.txt', sep='\t')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


- or we may simply use the `read_table` function:

In [49]:
pd.read_table('my_df.txt')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


<p><a name="index"></a></p>
## Index and Column Names

We have shown how index and columns can be dealt with when importing the data. We may also do it after data is loaded. Below we load the data frame by `read_csv` with its default parameter:

In [50]:
A = pd.read_csv('my_df.csv')
A

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


- We may assign any column name, say `[0, 1, 2, 3, 4]`, to the data frame.

In [51]:
A.columns = range(5)
A

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


- Index can be assigned in a similar way.

In [52]:
A.index =['a','b','c']
A

Unnamed: 0,0,1,2,3,4
a,1,2,3,4,hello
b,5,6,7,8,world
c,9,10,11,12,foo


The number of columns and the number of rows (length of index)  are recorded in the `shape` attribute, which is a tuple of two integers. The first one is the number of rows; the second is the number of columns. "First row; second column" is a convention adapted widely.

In [53]:
A.shape

(3, 5)

<p><a name="select"></a></p>
## Selection and Filtering

We often want to focus on a small portion of the data frame. We first load the the famous iris data set into a data frame. This data set consists of 5 columns:
- The first four columns are numeric features of the iris flowers, sepal length, sepal width, petal length and petal width.
- The last column is the species of each observation, including setosa, versicolor and virginica.
- We assign the data frame to the variable `iris`.

In [54]:
iris = pd.read_csv('iris.csv')

we very often want to have a quick glance at the data frame. The methods `head` and `tail` come in handy.

In [55]:
iris.head()  # head returns the first 5 rows

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [56]:
iris.tail() # tail returns the last 5 rows

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


<p><a name="single"></a></p>
### Selecting a Single Value

Selection can be done with integer index or the name of the row or column. Unfortunately, two ways are done with different methods: `loc` and `iloc`.

** `iloc` **

Selection with integers needs to be done with `iloc`. The `iloc` method takes two numbers as arguments, the first indicates the row index and the second column index. For example, if we want to select the intersection of **the fourth row** (the row indexed by 3) and **the third column** (column indexed by 2):

In [57]:
iris.iloc[3,2]

1.5

**`loc`**

Selection with names needs to be done with `loc`. Here it seems we use an integer for the row, but that is because the row names happen to be integers.

In [58]:
iris.loc[3, 'Petal.Length']

1.5

- We saw again the convention "first row; second column".

<p><a name="singleLine"></a></p>
### Selecting a Single Row or Column

Selecting a column can be done by specifying the column index or column name and pass `:` to the other argument. We use `head` below to save space, otherwise iPython notebook will print a long series.

In [59]:
iris.iloc[:,0].head()

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: Sepal.Length, dtype: float64

In [60]:
iris.loc[:,'Sepal.Length'].head()

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: Sepal.Length, dtype: float64

** Exercise 9**

- Use both `loc` and `iloc` to select the 5th row.

In [64]:
#### Your code here

Sepal.Length         5
Sepal.Width        3.6
Petal.Length       1.4
Petal.Width        0.2
Species         setosa
Name: 4, dtype: object
------------------------------
Sepal.Length         5
Sepal.Width        3.6
Petal.Length       1.4
Petal.Width        0.2
Species         setosa
Name: 4, dtype: object


<p><a name="multipleLine"></a></p>
### Selecting Multiple Rows or Columns

We first demonstrate how we select multiple columns. 

- With the `iloc` method, `:` comes in handy. `1:3` indicates list of `1` and `2`.

In [65]:
iris.iloc[:, 1:3].head()

Unnamed: 0,Sepal.Width,Petal.Length
0,3.5,1.4
1,3.0,1.4
2,3.2,1.3
3,3.1,1.5
4,3.6,1.4


- Omitting the left hand side of `:` means selecting all the columns from the the first one to what we specify in the right hand side of the `:`. For example, `:3` means the first, the second and the third columns.

In [66]:
iris.iloc[:, :3].head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length
0,5.1,3.5,1.4
1,4.9,3.0,1.4
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4


- `loc` can be used to select multiple columns as well. However, the list of the columns names need to be provided.

In [67]:
iris.loc[:, ['Sepal.Length', 'Petal.Length']].head()

Unnamed: 0,Sepal.Length,Petal.Length
0,5.1,1.4
1,4.9,1.4
2,4.7,1.3
3,4.6,1.5
4,5.0,1.4


**Exercise 10**

- Select the third to the seventh rows from the iris data set, with both `loc` and `iloc` method.
- Select the sepal width and the species columns from the iris data set.

In [73]:
#### You code here
print iris.iloc[2:7]
print '-'*70
print iris.loc[2:7]
print '-'*70
print iris.loc[:,['Sepal.Width','Species']]

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
5           5.4          3.9           1.7          0.4  setosa
6           4.6          3.4           1.4          0.3  setosa
----------------------------------------------------------------------
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
5           5.4          3.9           1.7          0.4  setosa
6           4.6          3.4           1.4          0.3  setosa
7           5.0          3.4           1.5          0.2  setosa
----------------------------------------------------------------------
     Sepal.Width    Specie

<p><a name="fancy"></a></p>
## Fancy Indexing 

We often want to select according to some criteria. Let's first understand the result of passing Bools to `loc`.

**Note**: `iloc` can not take Bools.

We have five columns in the iris data frame, lets create a list of five Bools.

In [106]:
bool_ = [False, False, True, True, False]

We pass `bool_` to the column index of the `loc` method. In this way we select the columns corresponding to where `bool_` has `True`. In this example, it's the third and the fourth columns.

In [107]:
iris.loc[:, bool_].head()

Unnamed: 0,Petal.Length,Petal.Width
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2


A list of bools is often generated with comparison operators.

**Note** The code below generated an object called pandas series. We don't discuss details about this object, it is enough to understand this is a list-like object and can be treated as a list.

In [108]:
iris['Species'] == 'setosa'

0       True
1       True
2       True
3       True
4       True
5       True
6       True
7       True
8       True
9       True
10      True
11      True
12      True
13      True
14      True
15      True
16      True
17      True
18      True
19      True
20      True
21      True
22      True
23      True
24      True
25      True
26      True
27      True
28      True
29      True
       ...  
120    False
121    False
122    False
123    False
124    False
125    False
126    False
127    False
128    False
129    False
130    False
131    False
132    False
133    False
134    False
135    False
136    False
137    False
138    False
139    False
140    False
141    False
142    False
143    False
144    False
145    False
146    False
147    False
148    False
149    False
Name: Species, dtype: bool

Passing this list of bools into the first argument in `loc`, we may select the rows whose species is setosa:

In [109]:
iris.loc[ iris['Species']=='setosa', :]

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


<p><a name="sort"></a></p>
## Sorting

Sorting is one of the most common tasks. We might want to sort the data frame according to the values, the index or the column names. There are of course multiple ways to achieve these, we introduce one method for each.

**sorting by values**

When sorting by values, we sort the rows according to one column. `sort_values` can be applied:

In [110]:
iris.sort_values(by='Sepal.Length').head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
13,4.3,3.0,1.1,0.1,setosa
42,4.4,3.2,1.3,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa


**sorting by column names**

Since the column names are often strings, in which case columns are sorted in alphabetical order.

**Note**: The `sort_index` method is used because both the column names and the index are both considered "index". The column names is the index of the columns (`axis = 1`), therefore:

In [111]:
iris.sort_index(axis=1).head()

Unnamed: 0,Petal.Length,Petal.Width,Sepal.Length,Sepal.Width,Species
0,1.4,0.2,5.1,3.5,setosa
1,1.4,0.2,4.9,3.0,setosa
2,1.3,0.2,4.7,3.2,setosa
3,1.5,0.2,4.6,3.1,setosa
4,1.4,0.2,5.0,3.6,setosa


** sort by index**

In our particular example, index has already been sorted. In case we want to sort the index, the syntax is simply:
```
DataFrame.sort_index(axis=0)
```

** Exercise 11**

- Import the iris data frame again:
```
iris = pd.read_csv('iris.csv')
```
- Select the sub data frame whose species are all virginica.
- How many rows have species virginica?
- Sort this sub data frame by petal length.
- What is the 20% percentile of the petal length among all virginica? **Hint**: Let's say 20% percentile is the observation that is greater than exactly 20% of the observations.

In [82]:
#### Your code here

Sepal.Length          5.8
Sepal.Width           2.8
Petal.Length          5.1
Petal.Width           2.4
Species         virginica
Name: 114, dtype: object

<p><a name="manipulate"></a></p>
## Data Manipulation

We introduce here how we can drop a column and create a new column either by deriving from old existent columns or insertion.

**`drop`**

`drop` method is used for removeing either a column (`axis=1`) or a row (`axis=0`).

In [113]:
# removing a column

iris.drop('Sepal.Length', axis=1).head()

Unnamed: 0,Sepal.Width,Petal.Length,Petal.Width,Species
0,3.5,1.4,0.2,setosa
1,3.0,1.4,0.2,setosa
2,3.2,1.3,0.2,setosa
3,3.1,1.5,0.2,setosa
4,3.6,1.4,0.2,setosa


In [114]:
# removing multiple rows

iris.drop([0,1,2], axis=0).head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa


**Insertion**

For insertion, we need to refer to a non-existent column. For example, we don't have any column named by `'A'`:

In [115]:
iris.loc[:,'A']

KeyError: 'the label [A] is not in the [columns]'

We can assign a list (with length equal to the number of rows) to this nonexistent column `'A'`, pandas understands it will insert it to the data frame and name it by `'A'`. 

**Note** Assignment is involved, so this is a mutating operator. Nothing is printed by iPython notebook.

In [116]:
iris.loc[:,'A'] = range(150)

But `iris` is updated.

In [117]:
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,A
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,2
3,4.6,3.1,1.5,0.2,setosa,3
4,5.0,3.6,1.4,0.2,setosa,4


**deriving from existent columns**

Quantitative analysis are ofter desired. The simplest one is probably **componentwise** computing. For exampl, we might want to multiply petal length and petal width to obtain petal area.

The usual multiplication among "columns" (without presenting details, these columns are actually pandas series) work as you might expect:

In [118]:
iris.loc[:,'Petal.Length']*iris.loc[:,'Petal.Width']

0       0.28
1       0.28
2       0.26
3       0.30
4       0.28
5       0.68
6       0.42
7       0.30
8       0.28
9       0.15
10      0.30
11      0.32
12      0.14
13      0.11
14      0.24
15      0.60
16      0.52
17      0.42
18      0.51
19      0.45
20      0.34
21      0.60
22      0.20
23      0.85
24      0.38
25      0.32
26      0.64
27      0.30
28      0.28
29      0.32
       ...  
120    13.11
121     9.80
122    13.40
123     8.82
124    11.97
125    10.80
126     8.64
127     8.82
128    11.76
129     9.28
130    11.59
131    12.80
132    12.32
133     7.65
134     7.84
135    14.03
136    13.44
137     9.90
138     8.64
139    11.34
140    13.44
141    11.73
142     9.69
143    13.57
144    14.25
145    11.96
146     9.50
147    10.40
148    12.42
149     9.18
dtype: float64

To record the result into a new column, we simply insert it:

In [119]:
iris.loc[:,'Petal.Area'] = iris.loc[:,'Petal.Length']*iris.loc[:,'Petal.Width']

In [120]:
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,A,Petal.Area
0,5.1,3.5,1.4,0.2,setosa,0,0.28
1,4.9,3.0,1.4,0.2,setosa,1,0.28
2,4.7,3.2,1.3,0.2,setosa,2,0.26
3,4.6,3.1,1.5,0.2,setosa,3,0.3
4,5.0,3.6,1.4,0.2,setosa,4,0.28
