# <img style="float: left; padding-right: 10px; width: 140px" src="lml.png"> LML - Learning Machine Learning 2018  


# Labs 1 and 3: Introduction to Python



**Universidad del Rosario**<br>
**Summer School 2018**<br>
**Main Instructor:** Pavlos Protopapas

---


## Programming Expectations
All assignments for this class will use Python and the browser-based iPython notebook format you are currently viewing. Python experience is not a prerequisite for this course, as long as you are comfortable learning on your own as needed. While we strive to make the programming component of this course straightforward, we won't devote much time to teaching programming or Python syntax. 

We will refer to the Python 3 [documentation](https://docs.python.org/3/) in this lab and throughout the course.  There are also many introductory tutorials to help build programming skills, which we are listed in the last section of this lab.

## Table of Contents 
<ol start="0">
<li> Learning Goals </li>
<li> Getting Started</li>
<li> Lists </li>
<li> Strings and Listiness </li>
<li> Dictionaries </li>
<li> Functions </li>
<li> Exceptions, classes and modules </li>
<li> Numpy </li>
<li> I/O Processing </li>
<li> Regular Expressions </li>
<li> Introduction to Pandas </li>
<li> Beautifulsoup </li>
<li> Plotting </li>
<li> References </li>
</ol>

## Part 0:  Learning Goals 
This introductory lab is a condensed tutorial in Python programming.  By the end of this lab, you will feel more comfortable:

* Writing short Python code using functions, loops, arrays, dictionaries, strings,  if statements.
* Manipulating Python lists and recognizing the list properties of other Python containers.
* Learning and reading Python documentation.  
* Know how to use `python` modules and libraries
* Use practical syntax for writing functions in `python`
* Work proficiently with `pandas`
* Know what regular expressions are
* Know how to work with regular expressions in `python`
* Use `Beautiful Soup` to parse `HTML` webpages
* Read and write files



## Part 1: Getting Started

### Importing modules
All notebooks should begin with code that imports *modules*, collections of built-in, commonly-used Python functions.  Below we import the Numpy module, a fast numerical programming library for scientific computing.  Future labs will require additional modules, which we'll import with the same `import MODULE_NAME as MODULE_NICKNAME` syntax. 

To execute the following cell click anywhere in it and then press CTRL + ENTER or click the "run cell" button at the toolbar. 

In [None]:
import numpy as np #imports a fast numerical programming library and assigns it the nickname "np" for quick reference

Now that Numpy has been imported, we can access some useful functions.  For example, we can use `mean` to calculate the mean of a set of numbers.

In [None]:
np.mean([1.2, 2, 3.3])

to calculate the mean of 1.2, 2, and 3.3.

The code above is not particularly efficient, and efficiency will be important for you when dealing with large data sets. We will see more efficient options later on.

### Calculations and variables

At the most basic level we can use Python as a simple calculator.

In [None]:
1 + 2

Notice integer division (//) and floating-point error below!

In [None]:
1/2, 1//2, 1.0/2.0, 3*3.2

The last line in a cell is returned as the output value, as above.  For cells with multiple lines of results, we can display results using ``print``, as can be seen below.

In [None]:
print(1 + 3.0, "\n", 9, 7)
5/3

We can store integer or floating point values as variables.  The other basic Python data types -- booleans, strings, lists -- can also be stored as variables. 

In [None]:
a = 1
b = 2.0

Here is the storing of a list:

In [None]:
a = [1, 2, 3]

Think of a variable as a label for a value, not a box in which you put the value. 

In [None]:
b = a
b

This DOES NOT create a new copy of `a`. It merely puts a new label on the memory at a, as can be seen by the following code:

In [None]:
print("a:", a)
print("b:", b)
a[1] = 7
print("a after change:", a)
print("b after change:", b)

Multiple items on one line in the interface are returned as a *tuple*, an immutable sequence of Python objects. Notice the parenthesis (), instead of the [] that refer to a list.

In [None]:
a = 1
b = 2.0
a + a, a - b, b * b, 10*a

We can obtain the type of a variable, and use boolean comparisons to test these types. 

In [None]:
type(a) == float

In [None]:
type(a) == int

>**EXERCISE**:  Create a tuple called `tup` with the following seven objects:

> - The first element is an integer of your choice
> - The second element is a float of your choice  
> - The third element is the sum of the first two elements
> - The fourth element is the difference of the first two elements
> - The fifth element is first element divided by the second element

> Display the output of `tup`.  What is the type of the variable `tup`? What happens if you try and change an item in the tuple? 

In [None]:
# your code here


## Part 2: Lists

Much of Python is based on the notion of a list.  In Python, a list is a sequence of items separated by commas, all within square brackets.  The items can be integers, floating points, or another type.  Unlike in C arrays, items in a Python list can be of different types, so Python lists are more versatile than traditional arrays in C or in other languages. 

Let's start out by creating a few lists.  

In [None]:
empty_list = []
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mixed_list = [1, 2., 3, 4., 5]
print(empty_list)
print(int_list)
print(mixed_list, float_list)

Lists in Python are zero-indexed, as in C.  The first entry of the list has index 0, the second has index 1, and so on.

In [None]:
print(int_list[0])
print(float_list[1])

What happens if we try to use an index that doesn't exist for that list?  Python will complain!

In [None]:
print(float_list[10])

A list has a length at any given point in the execution of the code, which we can find using the `len` function.

In [None]:
print(float_list)
len(float_list)

### Indexing on lists

And since Python is zero-indexed, the last element of `float_list` is

In [None]:
float_list[len(float_list)-1]

It is more idiomatic in python to use -1 for the last element, -2 for the second last, and so on

In [None]:
float_list[-1]

We can use the ``:`` operator to access a subset of the list.  This is called *slicing.* 

In [None]:
print(float_list[1:5])
print(float_list[0:2])

Below is a summary of list slicing operations:

<img src="images/ops3_v2.png" alt="Drawing" style="width: 600px;"/>

You can slice "backwards" as well:

In [None]:
float_list[:-2] # up to second last

In [None]:
float_list[:4] # up to but not including 5th element

You can also slice with a stride:

In [None]:
float_list[:4:2] # above but skipping every second element

We can iterate through a list using a loop.  Here's a for loop.

In [None]:
for elem in float_list:
    print(elem)

Or, if we like, we can iterate through a list using the indices using a for loop with  `in range`. This is not idiomatic and is not recommended, but accomplishes the same thing as above.

In [None]:
for i in range(len(float_list)):
    print(float_list[i])

What if you wanted the index as well?

Python has other useful functions such as `enumerate`,  which can be used to create a list of tuples with each tuple of the form `(index, value)`. 

In [None]:
for i, elem in enumerate(float_list):
    print(i,elem)

In [None]:
list(enumerate(float_list))

This is an example of an *iterator*, something that can be used to set up an iteration. When you call `enumerate`, a list of tuples is not created. Rather an object is created, which when iterated over (or when the `list` function is called using it as an argument), acts like you are in a loop, outputting one tuple at a time.

### Appending and deleting

We can also append items to the end of the list using the `+` operator or with `append`.

In [None]:
float_list + [.333]

In [None]:
len(float_list)

In [None]:
float_list.append(.444)

In [None]:
print(float_list)
len(float_list)

Go and run the cell with `float_list.append` a second time.  Then run the next line.  What happens?  

To remove an item from the list, use `del.`

In [None]:
del(float_list[2])
print(float_list)

### List Comprehensions

Lists can be constructed in a compact way using a *list comprehension*.  Here's a simple example.

In [None]:
squaredlist = [i*i for i in int_list]
squaredlist

And here's a more complicated one, requiring a conditional.

In [None]:
comp_list1 = [2*i for i in squaredlist if i % 2 == 0]
print(comp_list1)

This is entirely equivalent to creating `comp_list1` using a loop with a conditional, as below:

In [None]:
comp_list2 = []
for i in squaredlist:
    if i % 2 == 0:
        comp_list2.append(2*i)
        
comp_list2

The list comprehension syntax

```
[expression for item in list if conditional]

```

is equivalent to the syntax

```
for item in list:
    if conditional:
        expression
```

>**EXERCISE**:  Build a list that contains every prime number between 1 and 100, in two different ways:
1.  Using for loops and conditional if statements.
2.  *(Stretch Goal)* Using a list comprehension.  You should be able to do this in one line of code, and it may be helpful to look up the function `all` in the documentation.

In [None]:
# your code here


In [None]:
# your code here


## Part 3:  Strings and listiness

A list is a container that holds a bunch of objects.  We're particularly interested in Python lists because many other containers in Python, like strings, dictionaries, numpy arrays, pandas series and dataframes, and iterators like `enumerate`, have list-like properties.  This is known as [duck](https://en.wikipedia.org/wiki/Duck_typing) typing, a term coined by Alex Martelli, which refers to the notion that  *if it quacks like a duck, it is a duck*.  We'll soon see that these  containers quack like lists, so for practical purposes we can think of these containers as lists!  They are listy!

Containers that are listy have a set length, can be sliced, and can be iterated over with a loop.  Let's look at some listy containers now.

### Strings
We claim that strings are listy.  Here's a string.

In [None]:
astring = "kevin"

Like lists, this string has a set length, the number of characters in the string.

In [None]:
len(astring)

Like lists, we can slice the string.

In [None]:
print(astring[0:2])
print(astring[0:6:2])
print(astring[-1])

And we can iterate through the string with a loop.  Below is a while loop:

In [None]:
i = 0
while i < len(astring):
    print(astring[i])
    i = i + 1

This is equivalent to the for loop:

In [None]:
for character in astring:
    print(character)

So strings are listy.  

How are strings different from lists?  While lists are mutable, strings are immutable.  Note that an error occurs when we try to change the second elemnt of `string_list` from 1 to b.

In [None]:
print(float_list)
float_list[1] = 2.09
print(float_list)
print(astring)
astring[1] = 'b'
print(astring)

We can't use `append` but we can concatenate with `+`. Why is this?

In [None]:
astring = astring + ', pavlos, ' + 'protopapas'
print(astring)
type(astring)

What is happening here is that we are creating a new string in memory when we do `astring + ', pavlos, ' + 'protopapas'`. Then we are relabelling this string with the old lavel `astring`. This means that the old memory that `astring` labelled is forgotten. What happens to it? We'll find out later on. 

To summarize this section, for  practical purposes all containers that are listy have the following properties:

1.  Have a set length, which you can find using `len`
2.  Are iterable (via a loop)
3.  Are sliceable via : operations

We will encounter other listy containers soon.

>**EXERCISE**: Make three strings, called `first`, `middle`, and `last`, with your first, middle, and last names, respectively.  If you don't have a middle name, make up a middle name!  

>Then create a string called `full_name` that joins your first, middle, and last name, with a space separating your first, middle, and last names.  

>Finally make a string called `full_name_rev` which takes `full_name` and reverses the letters.  For example, if `full_name` is `Jane Beth Doe`, then `full_name_rev` is `eoD hteB enaJ`.



In [None]:
list(range(-1, -5,-1))

In [None]:
# your code here


## Part 4: Dictionaries
A dictionary is another storage container.  Like a list, a dictionary is a sequence of items.  Unlike a list, a dictionary is unordered and its items are accessed with keys and not integer positions.  

Dictionaries are the closest container we have to a database.

Let's make a dictionary with a few Harvard courses and their corresponding enrollment numbers.

In [None]:
enroll2016_dict = {'CS50': 692, 'CS109 / Stat 121 / AC 209': 312, 'Econ1011a': 95, 'AM21a': 153, 'Stat110': 485}
enroll2016_dict

In [None]:
enroll2016_dict['CS50']

In [None]:
enroll2016_dict.values()

In [None]:
enroll2016_dict.items()

In [None]:
for key, value in enroll2016_dict.items():
    print("%s: %d" %(key, value))

Simply iterating over a dictionary gives us the keys. This is useful when we want to do something with each item:

In [None]:
second_dict={}
for key in enroll2016_dict:
    second_dict[key] = enroll2016_dict[key]
second_dict

The above is an actual copy to another part of memory, unlike, `second_dict = enroll2016_dict` which would have made both variables label the same memory location.

In this example, the keys are strings corresponding to course names.  Keys don't have to be strings though.  

Like lists, you can construct dictionaries using a *dictionary comprehension*, which is similar to a list comprehension. Notice the brackets {} and the use of `zip`, which is another iterator that combines two lists together.

In [None]:
my_dict = {k:v for (k, v) in zip(int_list, float_list)}
my_dict

You can also create dictionaries nicely using the *constructor* function `dict`.

In [None]:
dict(a = 1, b = 2)

While dictionaries have some similarity to lists, they are not listy.  They do have a set length, and the can be iterated through with a loop, but they cannot be sliced, since they have no sense of an order. In technical terms, they satisfy, along with lists and strings, Python's *Sequence* protocol, which is a higher abstraction than that of a list.

### A cautionary word on iterators (read at home)

Iterators are a bit different from lists in the sense that they can be "exhausted". Perhaps its best to explain with an example

In [None]:
an_iterator = enumerate(astring)

In [None]:
type(an_iterator)

In [None]:
for i, c in an_iterator:
    print(i,c)

In [None]:
for i, c in an_iterator:
    print(i,c)

What happens, you get nothing when you run this again! This is because the iterator has been "exhausted", ie, all its items are used up. I have had answers go wrong for me because I wasn't careful about this. You must either track the state of the iterator or bypass this problem by not storing `enumerate(BLA)` in a variable, so that you dont inadvertantly "use that variable" twice.

## Part 5: Functions

A *function* is a reusable block of code that does a specfic task.  Functions are all over Python, either on their own or on objects.  

We've seen built-in Python functions and methods.  For example, `len` and `print` are built-in Python functions.  And at the beginning of the lab, you called `np.mean` to calculate the mean of three numbers, where `mean` is a function in the numpy module and numpy was abbreviated as `np`. This syntax allow us to have multiple "mean" functions" in different modules; calling this one as `np.mean` guarantees that we will pick up numpy's mean function.

### Methods

A function that belongs to an object is called a *method*. An example of this is `append` on an **existing** list. In other words, a *method* is a function on an **instance** of a type of object (also called **class**, here the list type).

In [None]:
print(float_list)
float_list.append(56.7) 
float_list

### User-defined functions

We'll now learn to write our own user-defined functions.  Below is the syntax for defining a basic function with one input argument and one output. You can also define functions with no input or output arguments, or multiple input or output arguments.

```
def name_of_function(arg):
    ...
    return(output)
```

The simplest function has no arguments whatsoever.

In [None]:
def print_greeting():
    print("Hello, welcome to LML 2018!")
    
print_greeting()

We can write functions with one input and one output argument.  Here are two such functions.

In [None]:
def square(x):
    x_sqr = x*x
    return(x_sqr)

def cube(x):
    x_cub = x*x*x
    return(x_cub)

square(5),cube(5)

### Lambda functions

Often we define a mathematical function with a quick one-line function called a *lambda*. No return statement is needed.

The big use of lambda functions in data science is for mathematical functions.

In [None]:
square = lambda x: x*x
print(square(3))


hypotenuse = lambda x, y: x*x + y*y

## Same as

# def hypotenuse(x, y):
#     return(x*x + y*y)

hypotenuse(3,4)

### Refactoring using functions

>**EXERCISE**: Write a function called `isprime` that takes in a positive integer $N$, and determines whether or not it is prime.  Return the $N$ if it's prime and return nothing if it isn't.  You may want to reuse part of your code from the exercise in Part 2.  

> Then, using a list comprehension and `isprime`, create a list `myprimes` that contains all the prime numbers less than 100.  

In [None]:
# your code here


Notice that what you just did is a **refactoring** of the algorithm you used earlier to find primes smaller than 100. This implementation reads much cleaner, and the function `isprime` which containes the "kernel" of the functionality of the algorithm can be **re-used** in other places. You should endeavor to write code like this.

### Default Arguments

Functions may also have *default* argument values.  Functions with default values are used extensively in many libraries.  

In [None]:
# This function can be called with x and y, in which case it will return x*y;
# or it can be called with x only, in which case it will return x*1.
def get_multiple(x, y = 1):
    return x*y

print("With x and y:", get_multiple(10, 2))
print("With x only:", get_multiple(10))

We can have multiple default values. 

In [None]:
def print_special_greeting(name, leaving = False, condition = "nice"):
    print("Hi", name)
    print("How are you doing on this", condition, "day?")
    if leaving:
        print("Please come back! ")

# Use all the default arguments.
print_special_greeting("Pavlos")

Or change all the default arguments:

In [None]:
print_special_greeting("Pavlos", True, "rainy")

Or use the first default argument but change the second one.

In [None]:
print_special_greeting("Pavlos", condition="horrible")

### Positional and keyword arguments 

These allow for even more flexibility.  

*Positional* arguments are used when you don't know how many input arguments your function be given.  Notice the single asterisk before the second argument name.

In [None]:
def print_siblings(name, *siblings):
    print(name, "has the following siblings:")
    for sib in siblings:
        print(sib)
    print()
print_siblings("John", "Ashley", "Lauren", "Arthur")
print_siblings("Mike", "John")
print_siblings("Terry")        

In the function above, arguments after the first input will go into a list called siblings. We can then process that list to extract the names.

*Keyword* arguments mix the named argument and positional properties.  Notice the double asterisks before the second argument name.

In [None]:
def print_brothers_sisters(name, **siblings):
    print(name, "has the following siblings:")
    for sib in siblings:
        print(sib, ":", siblings[sib])
    print()
    
print_brothers_sisters("John", Ashley="sister", Lauren="sister", Arthur="brother")

### Putting things together

Finally, when putting all those things together one must follow a certain order:
Below is a more general function definition.  The ordering of the inputs is key: arguments, default, positional, keyword arguments.
```
def name_of_function(arg1, arg2, opt1=True, opt2="CS109", *args, **kwargs):
    ...
    return(output1, output2, ...)
```

Positional arguments are stored in a tuple, and keyword arguments in a dictionary.

In [None]:
def f(a, b, c=5, *tupleargs, **dictargs):
    print("got", a, b, c, tupleargs, dictargs)
    return a
print(f(1,3))
print(f(1, 3, c=4, d=1, e=3))
print(f(1, 3, 9, 11, d=1, e=3)) # try calling with c = 9 to see what happens!

### Functions are first class

Python functions are *first class*, meaning that we can pass functions to other functions, built-in or user-defined. 

In [None]:
def sum_of_anything(x, y, f):
    print(x, y, f)
    return(f(x) + f(y))
sum_of_anything(3,4,square)

Finally, it's important to note that any name defined in this notebook is done at the *global* scope.  This means if you define your own `len` function, you will overshadow the system `len.`

>**EXERCISE:** Create a dictionary, called `ps_dict`, that contains with the primes less than 100 and their corresponding squares.

In [None]:
# your code here


## Part 6. Exceptions, classes and modules 

### Python Exception Handling
Sometimes you make a mistake when writing code.  Rather than crash, the program should fail gracefully, preferably with an informative error message.  These types of considerations are called *exception handling*.  A good way to deal with this in `python` is to use the `try-except` block.  Once again, we won't go deep here.  We'll just show you the basic structure and enthusiastically encourage you to use it when necessary.

Extensive documentation can be found at [Errors and Exceptions](https://docs.python.org/3/tutorial/errors.html).

#### Example
Suppose you write a function that looks like
```python
def bad_func(x: float, y: float) -> float:
    return x / y
```
and you call it with 
```python
bad_func(1.0, 0.0)
```
Right away, `python` returns
```
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-10-4a83987fd0cc> in <module>()
      2     return x / y
      3 
----> 4 bad_func(1, 0)

<ipython-input-10-4a83987fd0cc> in bad_func(x, y)
      1 def bad_func(x: float, y: float) -> float:
----> 2     return x / y
      3 
      4 bad_func(1, 0)

ZeroDivisionError: division by zero
```
This is the stack trace and it tells you where the errors occured (You will see this often in this class!). The ultimate problem is often at the very bottom (division by zero, in this case), and from top down you can see where in the code you were when tragedy struck.

The error was first detected on line 4 at `bad_func(1,0)` and then at 2 line of `bad_func()`.  Then we're told that the error was a division by zero.

So informative!  Can't ask for much more than that.  In fact, `python` has a whole host of exceptions that it is aware of.  You can find them at [Built-in Exception](https://docs.python.org/3/library/exceptions.html) in the documentation.

Suppose you don't want your program to die when it reaches an exception and want it to automatically fix the problem (or ignore it) and continue on its merry way.  You can use a `try-except` block to handle this.

In [None]:
def bad_func(x: float, y: float) -> float:
    try:
        result = x/y
    except ZeroDivisionError:
        print("WARNING:")
        print("You set y = 0 but y must be non-zero.")
        print("We are setting y = 1.  This may drastically change your results.")
        y = 1.0
        result = x/y
    return result

x, y = 1.0, 0.0
important_quantity = bad_func(x, y)

print("\n Your important_quantity has a value of {0:3.6f}".format(important_quantity))

Notice that our program will continue to run past the point of no return.  We were good developers and warned the user about what was happening.

For example, maybe $y$ is the standard deviation of a quantity.  The code must normalize all variables to the standard deviation.  If the user makes a mistake and accidentally calculates the standard deviation to be zero, then nothing else will work.  So, we just warn them what happened and don't carry out the normalization (by setting the standard deviation to $1$) and carry on with the analysis.  Hopefully it doesn't matter, but if it does, we at least warned the user about it.

### Python Classes, Modules, and Libraries

#### Classes/Objects
In true object oriented programming (OOP), the developer writes code around things called objects.  An object (or a class) groups together data and functions that operate on that data.  You might know this terminology from *C++* and other languages.

For example, maybe I have one function that calculates the area of a circle and another function that calculates the perimiter.  I could group these two functions together, along with data about the circle into a class called `circle`.
A user can create a particular circle by *instantiating* a `circle` oject, as we do below.

```python
from shapes import circle
my_circle = circle(r=5)
circle_area = my_circle.area()
circle_perimiter = my_circle.perimiter()
```
When a function is part of an object it is called a *method* instead of a function. Notice that methods are accessed using the *dot* notation.
**What really matters for this class is that we will often create an object and use the methods associated with it without having to worry about the object's internal workings**


#### Modules
Modules in python contain a bunch of code that logically fits together. Most often this is a bunch of classes and functions that address a particular need. For example, there could be a `shapes.py` file that contains the `circle` class used above, as well as a `triangle` class and maybe even a `will_they_fit` function to tell if one shape can fit inside another. That .py file would then be called a module, and we could import from it.

If we only want particular portions of a module, we use the from ___ import ___ syntax above. If we want the whole module, we can do this:
```python
import shapes
my_circle = shapes.circle(r=5)
my_triangle = shapes.triangle(3,4,5)
circle_area = my_circle.area()
fit_flag = shapes.will_they_fit(my_circle, my_triangle)
```

#### Libraries
Libraries may contain a bunch of modules that go together.  A library usually has a specific directory structure.  We won't discuss these further here, becuase as a user you only need to know about the import syntax above.

## Part 7. Numpy
Scientific Python code uses a fast array structure, called the numpy array. Those who have worked in Matlab will find this very natural. Let's make a numpy array.

In [None]:
my_array = np.array([5, 10, 15, 20])
my_array

Numpy arrays are listy. Below we compute length, slice, and iterate. But these are very bad ideas, for efficiency reasons we will see later on.

In [None]:
print(len(my_array))
print(my_array[1:3])
for elem in my_array:
    print(elem)

In general you should manipulate numpy arrays by using numpy module functions (`np.mean`, for example). You can calculate the mean of the array elements either by calling the method `.mean` on a numpy array or by applying the function np.mean with the numpy array as an argument.

In [None]:
print(my_array.mean())
print(np.mean(my_array))

You can generate random variates from a normal distribution with mean 0 and standard deviation 1 by doing:

In [None]:
normal_array = np.random.randn(1000)
print("The sample mean and standard devation are %f and %f, respectively." %(np.mean(normal_array), np.std(normal_array)))

Numpy supports a concept known as broadcasting, which dictates how arrays of different sizes are combined together. There are too many rules to list here, but importantly, multiplying an array by a number multiplies each element by the number. Adding a number adds the number to each element.
This means that if you wanted the distribution $N(5, 7)$ you could do:

In [None]:
normal_5_7 = 5 + 7*normal_array
np.mean(normal_5_7), np.std(normal_5_7)

There are many `numpy` array *constructors*. Here are some commonly used constructors. Look them up in the documentation.

In [None]:
zeros = np.zeros(10) # generates 10 floating point zeros
zeros

In [None]:
ones = np.ones(3)
ones

`Numpy` gains a lot of its efficiency from being strongly typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float of size appropriate for the machine (64 bit on a 64 bit machine).

In [None]:
zeros.dtype

In [None]:
np.ones(10, dtype='int') # generates 10 integer ones

Often you will want random numbers. Use the `random` constructor!

In [None]:
np.random.rand(10) # uniform on [0,1]

#### `numpy` supports vector operations

What does this mean? It means that to add two arrays instead of looping over each element (e.g. via a list comprehension as in base Python) you get to simply put a plus sign between the two arrays.

In [None]:
ones_array = np.ones(5)
twos_array = 2*np.ones(5)
ones_array + twos_array

Note that this behavior is very different from `python` lists, which just get longer when you try to + them.

In [None]:
first_list = [1., 1., 1., 1., 1.]
second_list = [2., 2., 2., 2., 2.]
first_list + second_list # not what you want

On some computer chips nunpy's addition actually happens in parallel, so speedups can be high. But even on regular chips, the advantage of greater readability is important.

Now you have seen how to create and work with simple one dimensional arrays in `numpy`.  You have also been introduced to some important `numpy` functionality (e.g. `mean` and `std`).

Next, we push ahead to two-dimensional arrays and begin to dive into some of the deeper aspects of `numpy`.

### 2D arrays
We can create two-dimensional arrays without too much fuss.

In [None]:
# create a 2d-array by handing a list of lists
my_array2d = np.array([ 
    [1, 2, 3, 4], 
    [5, 6, 7, 8], 
    [9, 10, 11, 12] 
])

# you can do the same without the pretty formatting (decide which style you like better)
my_array2d = np.array([ [1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12] ])


# 3 x 4 array of ones
ones_2d = np.ones([3, 4])
print(ones_2d, "\n")

# 3 x 4 array of ones with random noise
ones_noise = ones_2d + 0.01*np.random.randn(3, 4)
print(ones_noise, "\n")

# 3 x 3 identity matrix
my_identity = np.eye(3)
print(my_identity, "\n")

Like lists, `numpy` arrays are $0$-indexed.  Thus we can access the $n$th row and the $m$th column of a two-dimensional array with the indices $[n - 1, m - 1]$.

In [None]:
print(my_array2d)
print("element [2,3] is:", my_array2d[2, 3])

Numpy arrays can be sliced, and can be iterated over with loops.  
 
Notice that the list slicing syntax still works!  
`array[2:,3]` says "in the array, get rows 2 through the end, column 3]"  
`array[3,:]` says "in the array, get row 3, all columns".

Numpy functions will by default work on the entire array:

In [None]:
np.sum(ones_2d)

The axis `0` is the one going downwards (i.e. the rows), whereas axis `1` is the one going across (the columns). You will often use functions such as `mean` or `sum` along a particular axis. If you `sum` along axis 0 you are summing across the rows and will end up with one value per column. As a rule, any axis you list in the axis argument will dissapear.

In [None]:
np.sum(ones_2d, axis=0)

In [None]:
np.sum(ones_2d, axis=1)

<div class="exercise"><b>Exercise</b></div>
* Create a two-dimensional array of size $3\times 5$ and do the following:
  * Print out the array
  * Print out the shape of the array
  * Create two slices of the array:
    1. The first slice should be the last row and the third through last column
    2. The second slice should be rows $1-3$ and columns $3-5$
  * Square each element in the array and print the result

In [None]:
# your code here
A = np.array([ [5, 4, 3, 2, 1], [1, 2, 3, 4, 5], [1.1, 2.2, 3.3, 4.4, 5.5] ])
print(A, "\n")

# set length(shape)
dims = A.shape
print(dims, "\n")

# slicing
print(A[-1, 2:], "\n")
print(A[1:3, 3:5], "\n")

# squaring
A2 = A * A
print(A2)

#### `numpy` supports matrix operations
2d arrays are numpy's way of representing matrices. As such there are lots of built-in methods for manipulating them

Earlier when we generated the one-dimensional arrays of ones and random numbers, we gave `ones` and `random`  the number of elements we wanted in the arrays. In two dimensions, we need to provide the shape of the array, i.e., the number of rows and columns of the array.

In [None]:
three_by_four = np.ones([3,4])
three_by_four

You can transpose the array:

In [None]:
three_by_four.shape

In [None]:
four_by_three = three_by_four.T

In [None]:
four_by_three.shape

Matrix multiplication is accomplished by `np.dot`. The `*` operator will do element-wise multiplication.

In [None]:
print(np.dot(three_by_four, four_by_three)) # 3 x 3 matrix
np.dot(four_by_three, three_by_four) # 4 x 4 matrix

Numpy has functions to do the difficult matrix operations that are awful to do by hand.

In [None]:
matrix = np.random.rand(4,4) # a 4 by 4 matrix
matrix

Let's get the eigenvalues and eigenvectors!

In [None]:
np.linalg.eig(matrix)

How about inverses?

In [None]:
inv_matrix = np.linalg.inv(matrix) # the invert matrix
print(inv_matrix)

#prove it's the inverse
np.dot(matrix,inv_matrix)

Notice that there is a bit of 'rounding error' in the inverse calculation. This is because the computer can't store the exact values the inverse matrix asks for (they have more decimal places than the computer can hold). Built-in numpy routines manage these errors, which is why it's very important to use pre-built tools whenever possible, and to be very cautious when writing your own.

(It happens that there are even more advanced numpy functions like `np.linalg.solve` which are more accurate than just taking the naked inverse)

See the documentation to learn more about `numpy` functions as needed.

## Part 8:  I/O and Preprocessing
Much of data science and computational science involves reading data from files and writing to files.   This process is generally known as `I/O`.

There are many ways of accomplishing different `I/O` tasks.  `Python` has its own built-in functionality for reading and working with files.  You can read also read data with `numpy` and `pandas` among others.  We won't cover `numpy` for basic input parsing today, but you'll probably be introduced to it eventually.  We will spend a considerable amount of time on `pandas`.

### Part 8.1:  `Python`'s built-in I/O 
We'll work with the small file called `brief_comments.txt` in the `data` directory.

You can read in a file using the `open` function.  There is a "right" way and a "wrong" way of doing this.

An alternative way of reading data from a file is to use the `with` statement.

In [None]:
# This approach should not be used!
f = open("data/brief_comments.txt", "r") # Open the file for reading
dogs = f.read() # Read the file
f.close() # Remember to close the file!

In [None]:
# This approach is the correct way, and should always be used.
with open("data/brief_comments.txt", "r") as f:
    dogs = f.read()

#### Observations
The `with` statement does a few things for us automatically.  First, it closes the file for us so we don't need to remember to do this.  It is important to close a file when you're done with it!  There are a few reasons for this:
1. Having too many files open at once consumes resources
2. Not closing a file is sloppy coding
3. You might not see changes to the file until you close it

`with` even closes the file for us if an exception was thrown.  It's nice that the `with` statement handles that for us.

### Part 8.2:  Preprocessing

Now we can do some operations on the text.  We will explore a few methods here:
1. `len`
2. `split`
3. `lower`

There are many other methods as part of the `string` class.  These can be found in the documentation:  [String Methods](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [None]:
print(dogs) # What are the contents of the object we just read in?

In [None]:
type(dogs) # What kind of data are we dealing with?

In [None]:
l = len(dogs) # How many characters are in this string?
print(l)

In [None]:
dogs[10] # Let's access the 11th item

That's something of a letdown.  The `string` object is just one giant string so accessing the first item gives us the first character.  It would be more useful to access individual words.  We can use the `split` method to accomplish this: [`str.split`](https://docs.python.org/3/library/stdtypes.html#str.split).

In [None]:
words = dogs.split()
print(words)

In [None]:
type(words)

So `split` returned a `python` `list` by splitting the string into elements separated by white space.  Now let's see what the 11th item is.

In [None]:
words[10]

Very nice!  Let's explore some of the other cool string operations that we can do.

In [None]:
N = len(words) # Number of words
print("There are {0} words in our brief comments.".format(N))

#### Brief Interlude
We used the `format` method on a string.  The *pythonic* way of doing this in `python3` can be found at the following resources:
* [The `format` statement](https://docs.python.org/3/library/stdtypes.html#str.format) --- Syntax for using the `format` statement.
* [Format String Syntax](https://docs.python.org/3/library/string.html#formatstrings) --- Different ways for formating strings (e.g. printing integers, floats, etc).
* [Formatting Literal Strings](https://docs.python.org/3/reference/lexical_analysis.html#f-strings) --- An alternative way of formatting strings.  Not used in this lab.

#### End Brief Interlude

Let's get back to our nice little example.

Suppose we want to count the occurance of a particular word.  We could use the `count` method (see [`list.count()`](https://docs.python.org/3/tutorial/datastructures.html)).

How many times is the word `dogs` mentioned?

In [None]:
words.count("dogs")

That's not correct.  We can see clearly that the word `dogs` is mentioned $3$ times (it's a small enough text that we can count this manually).  The problem is that sometimes `dogs` is capitalized, sometimes it comes with a period, and sometimes it's all lowercase.  We need to do further processing.

In [None]:
more_words = [word.split('.')[0] for word in words] # List comprehension
more_words.count("dogs")

We found $2$ dogs!  Still not correct, but better than before.

What just happened here?!

In [None]:
# We can write the list comprehension as a for loop as follows:
more_words1 = []
for word in words:
    inter = word.split('.')
    inter1 = inter[0]
    more_words1.append(inter1)
more_words1.count("dogs")

Let's put this all into English.


1. First of all, this time we used `split` to split on periods rather than white space.  Splitting on white space is the default.  Beyond that, you have to tell it what to do the split on.
2. Second, we introduced a **list comprehension**.  The list comprehension structure is extremely useful.  You should take a look at the documentation:  [List Comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).  Basically, a list comprehension can be used to create a new list by applying operations on an old list.
3. Third, we accessed the first element of the new list in the list comprehension.
  * You see, each time the `split` method is called, it creates a new list.
  * But we know that we don't want a list of lists.
    - Try it out.  Just print the result from `[word.split('.') for word in words]` and see what it looks like.
  * We also know that the nested lists contain just one element (a single word), so we just access that word and get a string back.  If this sentence sounds strange to you, then you probably didn't try things out on your own like we suggested you do.

Still, we have more work to do.  Only $2$ occurrances of the word `dogs` were noted.  The other occurance happens with a capital letter.  Not to fear!  We can convert strings to lower case using the `lower` method.

In [None]:
my_str = 'HELLO Bonnie'
my_str.lower()

<div class="exercise"><b>Exercise</b></div>
Use a list comprehension to create a list of all lower case words starting from the `more_words` list that we just created.  Then print out the number of occurrances of the word `dogs`.

In [None]:
# Your code here


<div class="exercise"><b> Exercise </b> </div>
* `hamlet.txt` is in the `data` directory.  Open and read it into a variable called `hamlettext`.
* What is the type of `hamlettext`?  What is its length?  Print the first $500$ items of `hamlettext`.
* Create a list called `hamletwords` where the items are the words of the play.
  * Confirm that the list you created is really a list
  * Confirm that each element of the list is a string
  * Print the first 10 items in the list.  
  * Print "There are $N$ total words in Hamlet.",  where $N$ is the total number of words in Hamlet.
* Using a *list comprehension*, create `hamletwords_lc` which converts the items in `hamletwords` to lower-case. 
* Count the number of occurences of the word "thou".
* Use `set` to determine the set of unique words in `hamletwords_lc`.  Here's documentation on the `set` datatype:  [Sets](https://docs.python.org/3/tutorial/datastructures.html#sets).
  * Print "There are $M$ unique words in Hamlet.", where $M$ is the number of unique words.  As a sanity check, verify that $M < N$.
  * Your output should be 
  ```
  "There are 7456 unique words in Hamlet."
  ```

In [None]:
# Your code here



### Part 8.3:  Writing Files
So far, we've discussed how to read data from files.  We've used `python`'s built-in functionality.  Of course, you generally want to *write* data to files as well.  What's the point of generating data if you're not saving it somewhere?!

We'll begin this section by generating some data to write.

In [None]:
my_ints = [i for i in range(-5, 6)]
my_ints2 = [i*i for i in my_ints]
print("Our list is {0}.".format(my_ints2))

Now let's prepare to write this data out.

In [None]:
with open("data/datafile.txt", "w") as dataf:
    # header
    dataf.write("Here is a list of squared ints.\n\n")
    # Columns
    dataf.write("n")
    dataf.write(", ")
    dataf.write("n^2" + "\n")
    # Data
    for i, i2 in zip(my_ints, my_ints2):
        dataf.write("{}, {}\n".format(str(i), str(i2)))

Once again, there are a few things worth mentioning here.

1. This is pretty ugly. We even had to convert the floats and ints to strings before we could write things out.
2. We've introduced the `zip` method.
  * Here's the documentation: [`zip` documentation](https://docs.python.org/3/library/functions.html#zip)
  * Here's the essence:  Combine the two lists into a tuple.  Now you can iterate on the tuple.
  * More precisely, `zip` aggregates the *iterables* `nums` and `some_data` into an *iterator* of tuples.  
    - In our case, the iterables here are just lists (they can be iterated on).
    - The *iterator of tuples* is formed by pairing the first elements of each list into a tuple.
    - That tuple is iterated upon by the `for` statement during which the elements of the tuple are extracted.
  * This sounds complicated, but it makes things very clean and nice to work with.  You should practice using `zip` whenever possible.
3. The related cousin of `zip` is `enumerate`:  [`enumerate` documentation](https://docs.python.org/3/library/functions.html#enumerate).  We will use `enumerate` all the time, but not yet.

### `json`

We really don't want to write out complex data forms using the `write` method.

Fortunately, there are a bunch of ways to write out data:
* [`pickle`](https://docs.python.org/3/library/pickle.html#module-pickle) --- `python` specific; used for saving and loading `python` objects
* [`xml`](https://docs.python.org/3/library/xml.html) --- Commonly used standard for storing data.
* [`json`](https://docs.python.org/3/library/json.html#module-json) --- Extremely commonly used standard for sharing data.

We will focus on `json` because it is a commonly used standard for data exchange.  Here is a nice little tutorial on `json`: [Working With JSON Data in Python](https://realpython.com/python-json/).

Here are a few comments on `json`:
* Human readable
* Used both within and external to the `python` ecosystem
* Cannot represent all `python` types, but that's usually okay.  If you are only working in `python`, then you should probably just use `pickle`.

In [None]:
import json # import the json library

Suppose we have a `python` dictionary containing information about individual dogs in a particular dog shelter.

In [None]:
dog_shelter = {} # Initialize dictionary

# Set up dictionary elements
dog_shelter['dog1'] = {'name': 'Cloe', 'age': 3, 'breed': 'Border Collie', 'playgroup': 'Yes'}
dog_shelter['dog2'] = {'name': 'Karl', 'age': 7, 'breed': 'Beagle', 'playgroup': 'Yes'}

dog_shelter

We can access the elements of the dictionary like so:

In [None]:
dog_shelter['dog1']

In [None]:
dog_shelter['dog2']['name']

#### Writing to `json` file

Now we should save the dictionary to a file.  We decide to save it in `json` format because then many people will be able to read it and work with it.

In [None]:
with open('dog_shelter_info.txt', 'w') as output:  
    json.dump(dog_shelter, output)

Make sure the file is there in the same folder as this script. 

#### Reading from `json` file

Reading from a `json` file is also very easy.

In [None]:
with open('dog_shelter_info.txt', 'r') as f:
    dog_data = json.load(f)

In [None]:
print(dog_data)

Let's explore the data structure.  We know that it's a `python` dictionary.

In [None]:
for dogid, info in dog_data.items():
    print(dogid)
    print("{0} is a {1} year old {2}.".format(info['name'], info['age'], info['breed']))
    if info['playgroup'].lower() == 'yes':
        print("{0} can attend playgroup.".format(info['name']))
    else:
        print("{0} is not permitted at playgroup.".format(info['name']))
    print("======================================\n")

##  Part 9: Regular Expressions

### Background and Motivation
*Regular Expressions* (a.k.a. `regex` or `regexp`) are a tool for working with and manipulating text data.  We've already done some text manipulation in this lab.  We've shied away from particularly thorny examples until now.  Using `python`'s string methods is useful, but that approach has it's limitations.

Regular expressions provide a set of rules for working with text data.  At first, these expressions look completely foreign (e.g. `([0-9]+(\.[0-9]+){3})`), but once you know some of the basics they're not so bad.

As it turns out, the fundamentals of regular expressions are based upon abstract algebra.  Mathematicians have studied regular expressions simply to lay down and understand their theoretical underpinnings.  We won't go anywhere near that level of detail.  For us, regular expressions will simply be used to process some gnarly text data.

There are a few key `regex` patterns and concepts that you must know and be comfortable with.  That fact is, there are many ways to create a `regex` to search for a particular pattern.  Some approaches are more succinct than others.  As with most things, you will get better the more you practice.  You should try to make your `regex` patterns as crisp as possible while still mainting readabilty.

### Some resources
In order to become proficient with `regex`s, you are **strongly encouraged** to take the *RegexOne* tutorial at [https://regexone.com/](https://regexone.com/).  That tutorial is an interactive and accessible introduction to regular expressions and it can be done in an hour.  It contains problems at the end to test your knowledge.  The *RegexOne* website also contains a very nice demo for `Python3`.  This lab will borrow from the *RegexOne* `python` demo to walk you through some concepts.

You may also want to consider the book [Mastering Regular Expressions](http://shop.oreilly.com/product/9780596528126.do) for more details as well as some historical comments.

---

### Learning by Example
Suppose you have a string containing a date:

In [None]:
birthday = "June 11"

You would like to search this string for the month.  For such a simple string, this can easily be done with the `python` string methods.

In [None]:
birth_month = birthday.strip()[:-3]
print(birth_month)

We're after much more intense strings, which we'll process with regular expressions.  Let's warm up with a `regex` on this simple string.

In [None]:
regex = r"\w+" # A first regular expression

What in the world does this mean?!  Well, there are a few syntactical details here:
1. The `r` means that the string is a *raw string*.  This just tells `python` not to interpret backslashes and other metacharacters in the string.  For example, in order to render TeX, you must use a raw string.
2. The `\w` indicates any alphanumeric character.
3. The `+` indicates one or more occurances.

In English words, we say that `regex` is a regular expression that tries to match one or more occurances of alphanumeric characters.

We still haven't specified what string we want to find the matches in.  All we've done so far is specify a `regex`.

Let's remedy that.  We will now use the `python` `re` module to start matching some regular expressions in strings.  Here are two more resources for you:
* [`re` module documentation](https://docs.python.org/3/library/re.html) --- The official `python` documentation on the `re` module
* [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html#regex-howto) --- A gentler introduction to using the `re` module.

Honestly, your best bet is still to start with the resources found on the *RegexOne* site.

In [None]:
import re # Regular expression module
months = re.search(regex, birthday) # Search string for regex
print(months)

We just searched the `birthday` string for the regular expression contained in `regex`.  If the pattern doesn't match, then we get `None` in return, otherwise we get an object that contains some information.  In our case, the pattern matches something in `birthday`.  What information did we get?
* We are told the `start` and `end` of the matching pattern (that is the `span=(0, 4)`)
* We are told what matched (that is the `'June'`)

You can access the starting and ending indices with the `start()` and `end()` methods as follows:

In [None]:
print("The matched pattern starts at index {0} and ends at index {1}.".format(months.start(), months.end()))

Note that we could have used a very simple pattern to search for the word `June`:

In [None]:
regex = r"June"
re.search(regex, birthday)

Same answer!

In [None]:
re.search(r"Oct", birthday) # nothing prints out

**Note:** When a regex fails to match, like above, it can look a little weird. Instead of getting an empty regex object that prints out, we get `None`, which doesn't dispaly anything. Printing it still works though.

In [None]:
months = re.search(r"Oct", birthday)
print(months) # printing the match object shows us the result, even if no match was found.

As already mentioned, regular expressions work directly with text.  You need the fancier stuff when you have more complicated strings.  We'll get to that in a moment.  First, do the following exercise.

<div class=exercise><b>Exercise</b></div>
Consider the string 
```python
statement = "June is a lovely month."
```
* Use a regular expression to the find the pattern `June`.
* Create a new string, `fragment` from `statement`, which starts just after the word `June`.

Your output should be ` is a lovely month.`

In [None]:
statement = "June is a lovely month."
regex = r"June"
fragment = statement[re.search(regex, statement).end():]
print(fragment)

Okay, we're ready to move on to more interesting things.  We'll do this in a sequence demos.

First, let's try to get the day out of the birthday string.  We'll use some more intesting expressions to illustrate some of the important patterns.

#### We can use `\d` to get just digits.

In [None]:
regex = r"\d+"
re.search(regex, birthday)

#### We can use `[a-z]` for characters `a` to `z` and `[0-9]` for digits `0` to `9`.

In [None]:
regex = r"[A-Za-z]+"
re.search(regex, birthday)

Note that we had to specify both capital letters and lowercase letters.  We also needed the `+` pattern to make sure that one or more occurances of the characters were found.  If not, we would have only gotten one occurance as illustrated in the next example.

In [None]:
regex = r"[0-9]"
re.search(regex, birthday)

Only got the first occurance of `1`!

#### `findall()`

Let's start getting down to business.  We want the actual month and the actual day.  Not the whole thing.  That's not too hard given what we already have at our disposal.

In [None]:
regex_month = r"[A-Za-z]+"
month = re.findall(regex_month, birthday)
print(month)

regex_day = r"\d+"
day = re.findall(regex_day, birthday)
print(day)

The `findall()` method returns a list of all the pattern matches.  Very cool.  Now we're ready to move on to another very important concept: *groups*.

#### Groups
Let's say we have a busy string of birthdays:

In [None]:
birthdays = "June 11th, December 13th, September 21st, May 12th"

We want to get all the months and all the days.  This looks like a job for the `findall()` method.

In [None]:
regex = r"[A-Za-z]+"
bdays = re.findall(regex, birthdays)
print(bdays)

That's not right.  Almost, but not quite.  We can fix things in a bunch of ways.  Let's take this opportunity to introduce groups.

In [None]:
regex = r"([A-Za-z]+) (\d+\w+)"
bdays = re.findall(regex, birthdays)
print(bdays)

Let's try to unpack all of that:
* The parentheses indicate a group.  So, our first set of parentheses indicate that we want a pattern of characters with one or more occurances.
* Right after that first group, we have a space.
* Then we have another group.  This time, the group indicates a pattern with one or more occurances of numbers followed by one or more occurances of any alphanumeric characters.

We could have accomplished the same thing in a number of ways.  Here are a couple more possibilities:
```python
regex = r"([A-Za-z]+)\s(\d+\w+)"
regex = r"([A-Za-z]+)\s(\w+)"
regex = r"([A-Za-z]+) (\d+[a-z]+)"
```
You get the idea.

It's also possible to just get the months and days separately.

In [None]:
regex = r"[A-Za-z]+ \d+"
bdays = re.findall(regex, birthdays)
for bday in bdays:
    print(bday)

There are many other ways to play with these `regex` patterns.  Let's do an exercise.

<div class=exercise><b>Exercise</b></div>
* Open and read the file `shelterdogs.xml` into a string named `dogs`.  It should look like:

```
<?xml version="1.0" encoding="UTF-8"?>

<dogshelter>
    <dog id="dog1">
        <name> Cloe </name>
        <age> 3 </age>
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>
```
* Write a regular expression to match the dog names.  That is, you want to match the name inside the name tag: `<name> dog_name </name>`.
  * **Hint:** Use a group.
* Print out each name.

Your output should be 
```python
Chloe
Karl
```

In [None]:
# your code here


<div class=exercise><b>Exercise</b></div>
Although you successfully completed the previous exercise, you think it would have been nicer to strip out the first two lines of the `dogs` string.  Do that now.

**Hints:**
* The first line has some special metacharacters in it (e.g. ?, ", \n).  You can escape these by using a backslash. For example, \? treats ? like a real question mark.  Otherwise it's the *optional* character in regular expressions.
* Consider using [\n]+ to deal with the end of line character.

Your output should be:
```
<dogshelter>
    <dog id="dog1">
        <name> Cloe </name>
        <age> 3 </age>
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>
```

In [None]:
# your code here


This ends the `I/O` introduction.  We've discussed the following:
* How to read and write data using straight `python`.
* How to process text data using `python`'s built-in string methods.
* How to read and write `JSON` files.
* How to use regular expressions to process text data.

All of this was fine for the examples that we've done so far.  However, we are interested in working with very complicated text strings (possibly from log files and websites) and messy data.  Regular expressions will still be useful, but there are other tools available to make our lives easier.

First, we will introduce the `python` library `pandas` for working with complicated data types.  After that, we'll introduce the *BeautifulSoup* `python` library for reading and parsing data from websites.

## Part 10:  Introduction to Pandas

We'd like a data structure that can that can easily store variables of different types, that stores column names, and that we can reference by column name as well as by indexed position.  And it would be nice if this data structure came with built-in functions that we can use to manipulate it. 

`Pandas` is a package/library that does all of this!  The library is built on top of `numpy`.  

There are two basic `pandas` objects, *series* and *dataframes*, which can be thought of as enhanced versions of 1D and 2D `numpy` arrays, respectively.  

For reference, here is a useful `pandas` [cheatsheet](https://drive.google.com/folderview?id=0ByIrJAE4KMTtaGhRcXkxNHhmY2M&usp=sharing) and the `pandas` [documentation](https://pandas.pydata.org/pandas-docs/stable/).

In [None]:
import pandas as pd

### Importing data

Now let's read in some automatible data as a pandas *dataframe* structure.  

In [None]:
# Read in the csv files
dfcars=pd.read_csv("data/mtcars.csv")

# Display the header and the first five rows of data
dfcars.head()

Wow!  That was easy and the output looks very nice.  What we have now is a spreadsheet with indexed rows and named columns, called a *dataframe* in pandas.  `dfcars` is an instance of the `pd.DataFrame` class, created by calling the `pd.read_csv` function, which then calls the DataFrame constructor inside of it. If the last sentence is confusing, don't worry, it will become clearer later.  The take-away is that `dfcars` is a dataframe object, and it has methods (functions) belonging to it. For example, `df.head()` is a method that shows the first 5 rows of the dataframe.

A pandas dataframe is a set of columns pasted together into a spreadsheet, as shown in the schematic below.  The columns in `pandas` are called *series* objects.

![](images/pandastruct.png)

Initial data exploration is as simple as a one-liner.

In [None]:
dfcars.describe()

That's about as simple as you could ever ask for.

Returning to the `dfcars` dataframe, we notice that the first column has a bad name: "Unnamed: 0". Let's **clean** it up. 

In [None]:
dfcars=dfcars.rename(columns={"Unnamed: 0":"car name"})
dfcars.head()

### Dataframes and Series

Now that we have our automobile data loaded as a dataframe, we'd like to be able to manipulate it and its series, say by calculating statistics and plotting distributions of features.  Fortunately, like arrays and other containers, dataframes and series are listy, so we can apply the list operations we already know to these new containers.  Below we explore our dataframe and its properties.

#### set length

 The attribute `shape` tells us the dimension of the dataframe, the number of rows and columns in the dataframe, `(rows, columns)`.  Somewhat strangely, but fairly usefully, (which is why the developers of Pandas probably did it ) the `len` function outputs the number of rows in the dataframe, not the number of columns as we'd expect based on how dataframes are built up from pandas series (columns).  

In [None]:
print(dfcars.shape)     # 12 columns, each of length 32
print(len(dfcars))      # the number of rows in the dataframe, also the length of a series
print(len(dfcars.mpg))  # the length of a series

#### iteration via loops

 One consequence of the column-wise construction of dataframes is that you cannot easily iterate over the rows of the dataframe.  Instead, we iterate over the columns, for example, by printing out the column names via a for loop.

In [None]:
for ele in dfcars: # iterating iterates over column names though, like a dictionary
    print(ele)

Or we can call the attribute `columns`.  Notice the `Index` in the output below. We'll return to this shortly. 

In [None]:
dfcars.columns

We can iterate series in the same way that we iterate lists. Here we print out the number of cylinders for each of the 32 vehicles.  However, you shouldn't do this in general.  Try to use the built-in `pandas` methods.

In [None]:
for ele in dfcars.cyl:
    print(ele)

How do you iterate over rows?  Dataframes are put together column-by-column and you should be able to write code which never requires iteration over rows. But if you still find a need to iterate over rows, you can do it using `itertuples`.  See the documentation.  

**In general direct iteration through pandas series/dataframes is a bad idea.**

Instead, you should manipulate dataframes and series with `pandas` methods which are written to be very fast (i.e. they access series and dataframes at the `C` level).

#### slicing

Let's see how indexing works in dataframes.  Like lists in `python`, dataframes and series are zero-indexed.

In [None]:
dfcars.head()

In [None]:
# index for the dataframe
print(list(dfcars.index))

# index for the cyl series
dfcars.cyl.index

There are two ways to index dataframes:
1. the `loc` property indexes by label name
2. the `iloc` indexes by position in the index.

We'll illustrate this with a slightly modified version of `dfcars`, created by relabeling the row indices of `dfcars` to start at $5$ instead of $0$.

In [None]:
# create values from 5 to 36
new_index = [i+5 for i in range(32)]

# new dataframe with indexed rows from 5 to 36
dfcars_reindex = dfcars.reindex(new_index)
dfcars_reindex.head()

We now return the first three rows of `dfcars_reindex` in two different ways, first with `iloc` and then with `loc`. 

With `iloc` we use the command,

In [None]:
dfcars_reindex.iloc[0:3]

since `iloc` uses the position in the index. Notice that the argument `0:3` with `iloc` returns the first three rows of the dataframe, which have label names 5, 6, and 7. 

To access the same rows with `loc`, we write,

In [None]:
dfcars_reindex.loc[5:7] # or dfcars_reindex.loc[0:7]

since `loc` indexes via the label name.  

Here's another example where we return three rows of `dfcars_reindex` that correspond to column attributes `mpg`, `cyl`, and `disp`.  First do it with `iloc`:

In [None]:
dfcars_reindex.iloc[2:5, 1:4]

Notice that rows we're accessing, 2, 3, and 4, have label names 7, 8, and 9, and the columns we're accessing, 1, 2, and 3, have label names `mpg`, `cyl`, and `disp`.  So for both rows and columns, we're accessing elements of the dataframe using the integer position indices.  Now let's do it with `loc`:

In [None]:
dfcars_reindex.loc[7:9, ['mpg', 'cyl', 'disp']]

We don't have to remember that `disp` is the third column of the dataframe --- we can simply access it with `loc` using the label name `disp`. 

Generally we prefer `iloc` for indexing rows and `loc` for indexing columns. 

<div class="exercise"><b>Exercise (to do at home)</b></div>
In this exercise you'll examine the documentation to generate a toy dataframe from scratch.  Go to the documentation and click on "10 minutes to pandas" in the table of contents.  Then do the following:

>1.  Create a series called `column_1` with entries 0, 1, 2, 3.

>2.  Create a second series called `column_2` with entries 4, 5, 6, 7.

>3.  Glue these series into a dataframe called `table`, where the first and second labelled column of the dataframe are `column_1` and `column_2`, respectively.  In the dataframe, `column_1` should be indexed as `col_1` and `column_2` should be indexed as `col_2`.

>4. Oops!  You've changed your mind about the index labels for the columns.  Use `rename` to rename `col_1` as `Col_1` and `col_2` as `Col_2`.  

> *Stretch*: Can you figure out how to rename the row indexes?  Try to rename `0` as `zero`, `1` as `one`, and so on.

In [None]:
# your code here



### Reading `json` into `pandas` dataframe

Before moving on, there is one more convenient thing to discuss.

Hopefully you remember reading and writing data to `json` files from earlier in the lab.

Now that you're equipped with `pandas`, we can discuss reading `json` data into `pandas` dataframes!

Recall that we saved a `json` file earlier called `dog_shelter_info.txt`.  Let's load it up, convert it to a `pandas` dataframe, and take a look.

In [None]:
# Load dog shelter data
with open('dog_shelter_info.txt', 'r') as f:
    dog_data = json.load(f)

dog_data_json_str = json.dumps(dog_data) # Convert data to json string
print(dog_data_json_str)
df = pd.read_json(dog_data_json_str) # Convert to pandas dataframe
df.head() # Look at data

#### Recap

At this point, you have:
* Refreshed your `python` knowledge
* Learned about basic `python` I/O
* Learned basic text processing with `python` string methods
* Learned advanced text processing with regular expressions
* Learned a little bit about the `json` format
* Become proficient with `pandas`

The last part of this lab will focus on working with even uglier data formats.  Specifically, we will look at parsing data from a web page.  Fortunately, there is a wonderful library out there that makes life much easier in this regard.  It is called *BeautifulSoup*.

##  Part 11: Beautiful Soup 
Data Engineering, the process of gathering and preparing data for analysis, is a very big part of Data Science.

Datasets might not be formatted in the way you need (e.g. you have categorical features but your algorithm requires numerical features); or you might need to cross-reference some dataset to another that has a different format; or you might be dealing with a dataset that contains missing or invalid data.

These are just a few examples of why data retrieval and cleaning are so important.

---

### `requests`:  Retrieving Data from the Web

`Python` has many built-in libraries that were developed over the years to retrieve data from the Internet (e.g. `urllib`, `urllib2`, `urllib3`).

However, these libraries are very low-level and somewhat hard to use. They become especially cumbersome when you need to issue POST requests or authenticate against a web service.

Luckily, as with most tasks in `Python`, someone has developed a library that simplifies these tasks. In reality, the requests made on this lab are fairly simple, and could easily be done using one of the built-in libraries. However, it is better to get acquainted with `requests` as soon as possible, since you will probably need it in the future.

In [None]:
# You tell Python that you want to use a library with the import statement.
import requests

Now that the requests library was imported into our namespace, we can use the functions offered by it.

In this case we'll use the appropriately named `get` function to issue a *GET* request. This is equivalent to typing a URL into your browser and hitting enter.

In [None]:
# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

Python is an Object Oriented language, and everything on it is an object. Even built-in functions such as `len` are just syntactic sugar for acting on object properties.

We will not dwell too long on OO concepts, but some of Python's idiosyncrasies will be easier to understand if we spend a few minutes on this subject.

When you evaluate an object itself, such as the `req` object we created above, Python will automatially call the `__str__()` or `__repr__()` method of that object. The default values for these methods are usually very simple and boring. The `req` object however has a custom implementation that shows the object type (i.e. `Response`) and the HTTP status number (200 means the request was successful).

In [None]:
req

Just to confirm, we will call the `type` function on the object to make sure it agrees with the value above.

In [None]:
type(req)

Another very nifty Python function is `dir`. You can use it to list all the properties of an object.

By the way, properties starting with a single and double underscores are usually not meant to be called directly.

In [None]:
dir(req)

Right now `req` holds a reference to a *Request* object; but we are interested in the text associated with the web page, not the object itself.

So the next step is to assign the value of the `text` property of this `Request` object to a variable.

In [None]:
page = req.text
page[20000:30000]

Great! Now we have the text of the Harvard University Wikipedia page. But this mess of HTML tags would be a pain to parse manually. Which is why we will use another very cool Python library called `BeautifulSoup`.

### `BeautifulSoup`

Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.

You'll notice that the `import` statement bellow is different from what we used for `requests`. The _from library import thing_ pattern is useful when you don't want to reference a function byt its full name (like we did with `requests.get`), but you also don't want to import every single thing on that library into your namespace.

In [None]:
from bs4 import BeautifulSoup

`BeautifulSoup` can deal with `HTML` or `XML` data, so the next line parses the contents of the `page` variable using its `HTML` parser, and assigns the result of that to the `soup` variable.

In [None]:
soup = BeautifulSoup(page, 'html.parser')

In [None]:
type(soup)

Doesn't look much different from the `page` object representation. Let's make sure the two are different types.

In [None]:
type(page)

Looks like they are indeed different.

`BeautifulSoup` objects have a cool little method that allows you to see the `HTML` content in a nice, indented way.

In [None]:
print(soup.prettify()[:1000])

Looks like it's our page!

We can now reference elements of the `HTML` document in different ways. One very convenient way is by using the dot notation, which allows us to access the elements as if they were properties of the object.

In [None]:
soup.title

This is nice for `HTML` elements that only appear once per page, such the the `title` tag. But what about elements that can appear multiple times?

In [None]:
# Be careful with elements that show up multiple times.
soup.p

Uh Oh. Turns out the attribute syntax in `Beautiful` soup is what is called *syntactic sugar*. That's why it is safer to use the explicit commands behind that syntactic sugar I mentioned. These are:
* `BeautifulSoup.find` for getting single elements, and 
* `BeautifulSoup.find_all` for retrieving multiple elements.

In [None]:
len(soup.find_all("p"))

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "Demographics" table, but first we need to find it.

One of the `HTML` attributes that will be very useful to us is the `class` attribute.

Getting the class of a single element is easy!

In [None]:
soup.table["class"]

Next we will use a *list comprehension* to see all the tables that have a `class` attribute. 

In [None]:
# the classes of all tables that have a class attribute set on them
[t["class"] for t in soup.find_all("table") if t.get("class")]

As already mentioned, we will be using the Demographics table for this lab. The next cell contains the `HTML` elements of said table. We will render it in different parts of the notebook to make it easier to follow along the parsing steps.

In [None]:
table_demographics = soup.find_all("table", "wikitable")[2]

In [None]:
from IPython.core.display import HTML
HTML(str(table_demographics))

First we'll use a list comprehension to extract the rows (*tr*) elements.

In [None]:
rows = [row for row in table_demographics.find_all("tr")]
print(rows)

In [None]:
header_row = rows[0]
HTML(str(header_row))

We will then use a `lambda` expression to replace new line characters with spaces. `Lambda` expressions are to functions what list comprehensions are to lists: namely a more concise way to achieve the same thing.

In reality, both lambda expressions and list comprehensions are a little different from their function and loop counterparts. But for the purposes of this class we can ignore those differences.

In [None]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

#### Splitting the data
Next we extract the text value of the columns. If you look at the table above, you'll see that we have three columns and six rows.

Here we're doing the following:
* Taking the first element (`Python` indices start at zero)
* Iterating over the *th* elements inside it
* Taking the text value of those elements

We should end up with a list of column names.

But there is one little caveat: the first column of the table is actually an empty string (look at the cell right above the row names). We could add it to our list and then remove it afterwards; but instead we will use the `if` statement inside the list comprehension to filter that out.

In the following cell, `get_text` will return an empty string for the first cell of the table, which means that the test will fail and the value will not be added to the list.

In [None]:
# the if col.get_text() takes care of no-text in the upper left
columns = [rem_nl(col.get_text()) for col in header_row.find_all("th") if col.get_text()]
columns

Now let's do the same for the rows. Notice that since we have already parsed the header row, we will continue from the second row.

In [None]:
indexes = [row.find("th").get_text() for row in rows[1:]]
indexes

Now we want to transform the string on the cells to integers.  To do this, we follow a very common `python` pattern:
1. Check if the last character of the string is a percent sign
2. If it is, then convert the characters before the percent sign to integers
3. If one of the prior checks fails, return a value of `None`

These steps can be conveniently packaged into a function using `if-else` statements.

In [None]:
def to_num(s):
    if s[-1] == "%":
        return int(s[:-1])
    else:
        return None

Notice the `Python` slices are open on the upper bound. So the `[:-1]` construct will return all elements of the string, except for the last.

Another nice way to write our `to_num` function would be
```python
def to_num(s):
    return int(s[:-1]) if s[-1] == "%" else None
```
Notice that we only had to write `return` one time and everything conveniently fits on one line.  I'll leave it up to you to decide if it's readable or not.

Now we use the `to_num` function in a list comprehension to parse the table values.

Notice that we have two `for ... in ...` in this list comprehension. That is perfectly valid and somewhat common.

Although there is no real limit to how many iterations you can perform at once, having more than two can be visually unpleasant, at which point either regular nested loops or saving intermediate comprehensions might be a better solution.

In [None]:
values = [to_num(value.get_text()) for row in rows[1:] for value in row.find_all("td")]
values

The problem with the list above is that the values lost their grouping.

The `zip` function is used to combine two sequences element wise. So 
```python
zip([1,2,3], [4,5,6])
```
would return
```python
[(1, 4), (2, 5), (3, 6)]
```

Next we create three arrays corresponding to the three columns by putting every three values in each list.

In [None]:
stacked_values_lists = [values[i::3] for i in range(len(columns))]
stacked_values_lists

We then use `zip`. 

In [None]:
stacked_values = zip(*stacked_values_lists)
list(stacked_values)

Notice the use of the `*` in front: that converts the list of lists to a set of arguments to `zip`. See the ASIDE below.

In [None]:
# Here's the original HTML table for visual understanding
HTML(str(table_demographics))

## Part 12: Plotting
Conveying your findings convincingly is an absolutely crucial part of any analysis. Therefore, you must be able to write well and make compelling visuals.  Creating informative visuals is an involved process and we won't cover that in this lab.  However, part of creating informative data visualizations means generating *readable* figures.  If people can't read your figures or have a difficult time interpreting them, they won't understand the results of your work.  Here are some non-negotiable commandments for any plot:
* Label $x$ and $y$ axes
* Axes labels should be informative
* Axes labels should be large enough to read
* Make tick labels large enough
* Include a legend if necessary
* Include a title if necessary
* Use appropriate line widths
* Use different line styles for different lines on the plot
* Use different markers for different lines

There are other important elements, but that list should get you started on your way.

Here is the anatomy of a figure:
 <img src="https://tacaswell.github.io/matplotlib/_images/anatomy.png" alt="Drawing" style="width: 500px;"/>
 
taken from [showcase example code: anatomy.py](https://tacaswell.github.io/matplotlib/examples/showcase/anatomy.html).

We will work with `matplotlib` and `seaborn` for plotting in this class.  Today's lab will focus on `matplotlib`, which is a very powerful `python` library for making scientific plots.  `seaborn` is a little more specialized in that it was developed for statistical data visualization.  We will put in a little bit of `seaborn` at the end, but if you are comfortable with `matplotlib` then `seaborn` will be a breeze.

Before diving in, one more note should be made.  We will not focus on the internal aspects of `matplotlib`.  Today's lab will really only focus on the basics and developing good plotting practices.  There are many excellent tutorials out there for `matplotlib`.  For example,
* [`matplotlib` homepage](https://matplotlib.org/)
* [`matplotlib` tutorial](https://github.com/matplotlib/AnatomyOfMatplotlib)

Okay, let's get started!

### `matplotlib`

First, let's generate some data.

<div class="exercise"><b>Exercise</b></div>
Later on in this lab, we will use the following three functions to make some plots:

* Logistic function:
  \begin{align*}
    f\left(z\right) = \dfrac{1}{1 + be^{-az}}
  \end{align*}
  where $a$ and $b$ are parameters.
* Hyperbolic tangent:
  \begin{align*}
    g\left(z\right) = b\tanh\left(az\right) + c
  \end{align*}
  where $a$, $b$, and $c$ are parameters.
* Rectified Linear Unit:
  \begin{align*}
    h\left(z\right) = 
    \left\{
      \begin{array}{lr}
        z, \quad z > 0 \\
        \epsilon z, \quad z\leq 0
      \end{array}
    \right.
  \end{align*}
  where $\epsilon < 0$ is a small, positive parameter.

You are given the code for the first two functions.  Notice that $z$ is passed in as a `numpy` array and that the functions are returned as `numpy` arrays.  Parameters are passed in as floats.

You should write a function to compute the rectified linear unit.  The input should be a `numpy` array for $z$ and a positive float for $\epsilon$.

In [None]:
# Your code here
import numpy as np

def logistic(z: np.ndarray, a: float, b: float) -> np.ndarray:
    """ Compute logistic function
      Inputs:
         a: exponential parameter
         b: exponential prefactor
         z: numpy array; domain
      Outputs:
         f: numpy array of floats, logistic function
    """
    
    den = 1.0 + b * np.exp(-a * z)
    return 1.0 / den

def stretch_tanh(z: np.ndarray, a: float, b: float, c: float) -> np.ndarray:
    """ Compute stretched hyperbolic tangent
      Inputs:
         a: horizontal stretch parameter (a>1 implies a horizontal squish)
         b: vertical stretch parameter
         c: vertical shift parameter
         z: numpy array; domain
      Outputs:
         g: numpy array of floats, stretched tanh
    """
    return b * np.tanh(a * z) + c

def relu(z: np.ndarray, eps: float = 0.01) -> np.ndarray:
    """ Compute rectificed linear unit
      Inputs:
         eps: small positive parameter
         z: numpy array; domain
      Outputs:
         h: numpy array; relu
    """
    return np.fmax(z, eps * z)

Now let's make some plots.  First, let's just warm up and plot the logistic function.

In [None]:
x = np.linspace(-5.0, 5.0, 100) # Equally spaced grid of 100 pts between -5 and 5

f = logistic(x, -1.0, 1.0) # Generate data

In [None]:
import matplotlib.pyplot as plt

# This is only needed in Jupyter notebooks!  Displays the plots for us.
%matplotlib inline 

plt.plot(x, f); # Use the semicolon to suppress some iPython output (not needed in real Python scripts)

Wonderful!  We have a plot.  It's terribly ugly and uninformative.  Let's clean it up a bit by putting some labels on it.

In [None]:
plt.plot(x, f)
plt.xlabel('x')
plt.ylabel('f')
plt.title('Logistic Function');

Okay, it's getting better.  Still super ugly.  I see these kinds of plots at conferences all the time.  Unreadable.  We can do better.  Much, much better.  First, let's throw on a grid.

In [None]:
plt.plot(x, f)
plt.xlabel('x')
plt.ylabel('f')
plt.title('Logistic Function')
plt.grid(True)

At this point, our plot is starting to get a little better but also a little crowded.

#### A note on gridlines
Gridlines can be very helpful in many scientific disciplines.  They help the reader quickly pick out important points and limiting values.  On the other hand, they can really clutter the plot.  Some people recommend never using gridlines, while others insist on them being present.  The correct approach is probably somewhere in between.  Use gridlines when necessary, but dispense with them when they take away more than they provide.  Ask yourself if they help bring out some important conclusion from the plot.  If not, then best just keep them away.

Before proceeding any further, I'm going to change notation.  The plotting interface we've been working with so far is okay, but not as flexible as it can be.  In fact, I don't usually generate my plots with this interface.  I work with slightly lower-level methods, which I will introduce to you now.  The reason I need to make a big deal about this is because the lower-level methods have a slightly different API.  This will become apparent in my next example.

In [None]:
fig, ax = plt.subplots(1,1) # Get figure and axes objects

ax.plot(x, f) # Make a plot

# Create some labels
ax.set_xlabel('x')
ax.set_ylabel('f')
ax.set_title('Logistic Function')

# Grid
ax.grid(True)

Wow, it's *exactly* the same plot!  Notice, however, the use of `ax.set_xlabel()` instead of `plt.xlabel()`.  The difference is tiny, but you should be aware of it.  I will use this plotting syntax from now on.

What else do we need to do to make this figure better?  Here are some options:
* Make labels bigger!
* Make line fatter
* Make tick mark labels bigger
* Make the grid less pronounced
* Make figure bigger

Let's get to it.

In [None]:
fig, ax = plt.subplots(1,1, figsize=(10,6)) # Make figure bigger

ax.plot(x, f, lw=4) # Linewidth bigger
ax.set_xlabel('x', fontsize=24) # Fontsize bigger
ax.set_ylabel('f', fontsize=24) # Fontsize bigger
ax.set_title('Logistic Function', fontsize=24) # Fontsize bigger
ax.grid(True, lw=1.5, ls='--', alpha=0.75) # Update grid

Notice:
* `lw` stands for `linewidth`.  We could also write `ax.plot(x, f, linewidth=4)`
* `ls` stands for `linestyle`.
* `alpha` stands for transparency.

Things are looking good now!  Unfortunately, people still can't read the tick mark labels.  Let's remedy that presently.

In [None]:
fig, ax = plt.subplots(1,1, figsize=(10,6)) # Make figure bigger

# Make line plot
ax.plot(x, f, lw=4)

# Update ticklabel size
ax.tick_params(labelsize=24)

# Make labels
ax.set_xlabel(r'$x$', fontsize=24) # Use TeX for mathematical rendering
ax.set_ylabel(r'$f(x)$', fontsize=24) # Use TeX for mathematical rendering
ax.set_title('Logistic Function', fontsize=24)

ax.grid(True, lw=1.5, ls='--', alpha=0.75)

The only thing remaining to do is to change the $x$ limits.  Clearly these should go from $-5$ to $5$.

In [None]:
fig, ax = plt.subplots(1,1, figsize=(10,6)) # Make figure bigger

# Make line plot
ax.plot(x, f, lw=4)

# Set axes limits
ax.set_xlim(x.min(), x.max())

# Update ticklabel size
ax.tick_params(labelsize=24)

# Make labels
ax.set_xlabel(r'$x$', fontsize=24) # Use TeX for mathematical rendering
ax.set_ylabel(r'$f(x)$', fontsize=24) # Use TeX for mathematical rendering
ax.set_title('Logistic Function', fontsize=24)

ax.grid(True, lw=1.5, ls='--', alpha=0.75)

You can play around with figures forever making them perfect.  At this point, everyone can read and interpret this figure just fine.  Don't spend your life making the perfect figure.  Make it good enough so that you can convey your point to your audience.  Then save if it for later.

In [None]:
fig.savefig('logistic.png')

Done!  Let's take a look.
![](logistic.png)

#### Resources
If you want to see all the styles available, please take a look at the documentation.
* [Line styles](https://matplotlib.org/2.0.1/api/lines_api.html#matplotlib.lines.Line2D.set_linestyle)
* [Marker styles](https://matplotlib.org/2.0.1/api/markers_api.html#module-matplotlib.markers)
* [Everything you could ever want](https://matplotlib.org/2.0.1/api/lines_api.html#matplotlib.lines.Line2D.set_marker)

We haven't discussed it yet, but you can also put a legend on a figure.  You'll do that in the next exercise.  Here are some additional resources:
* [Legend](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html)
* [Grid](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.grid.html)

<div class="exercise"><b>Exercise</b></div>

Do the following:
* Make a figure with the logistic function, hyperbolic tangent, and rectified linear unit.
* Use different line styles for each plot
* Put a legend on your figure

Here's an example of a figure:
![](nice_plots.png)

You don't need to make the exact same figure, but it should be just as nice and readable.

In [None]:
# your code here

# First get the data
f = logistic(x, -2.0, 1.0)
g = stretch_tanh(x, 2.0, 0.5, 0.5)
h = relu(x)

fig, ax = plt.subplots(1,1, figsize=(10,6)) # Create figure object

# Make actual plots
# (Notice the label argument!)
ax.plot(x, f, lw=4, ls='-', label=r'$L(x;1)$')
ax.plot(x, g, lw=4, ls='--', label=r'$\tanh(2x)$')
ax.plot(x, h, lw=4, ls='-.', label=r'$relu(x; 0.01)$')

# Make the tick labels readable
ax.tick_params(labelsize=24)

# Set axes limits to make the scale nice
ax.set_xlim(x.min(), x.max())
ax.set_ylim(h.min(), 1.1)

# Make readable labels
ax.set_xlabel(r'$x$', fontsize=24)
ax.set_ylabel(r'$h(x)$', fontsize=24)
ax.set_title('Activation Functions', fontsize=24)

# Set up grid
ax.grid(True, lw=1.75, ls='--', alpha=0.75)

# Put legend on figure
ax.legend(loc='best', fontsize=24);

fig.savefig('nice_plots.png')

There a many more things you can do to the figure to spice it up.  Remember, there must be a tradeoff between making a figure look good and the time you put into it.  

**The guiding principle should be that your audience needs to easily read and understand your figure.**

There are of course other types of figures including, but not limited to, 
* Scatter plots (you will use these all the time)
* Bar charts
* Histograms
* Contour plots
* Surface plots
* Heatmaps

There is documentation on each one of these.  You should feel comforatable enough with the plotting API now to dig in and make readable, understandable plots.

Before moving on, we will discuss another way to make your plots look good without all the hassle.  I'll make a beautiful plot without having to specify annoying arguments every single time.

In [None]:
import config # User-defined config file
plt.rcParams.update(config.pars) # Update rcParams to make nice plots

# First get the data
f1 = logistic(x, -1.0, 1.0)
f2 = logistic(x, -2.0, 1.0)
f3 = logistic(x, -3.0, 1.0)

fig, ax = plt.subplots(1,1, figsize=(10,6)) # Create figure object

# Make actual plots
# (Notice the label argument!)
ax.plot(x, f1, ls='-', label=r'$L(x;-1)$')
ax.plot(x, f2, ls='--', label=r'$L(x;-2)$')
ax.plot(x, f3, ls='-.', label=r'$L(x;-3)$')

# Set axes limits to make the scale nice
ax.set_xlim(x.min(), x.max())
ax.set_ylim(h.min(), 1.1)

# Make readable labels
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$h(x)$')
ax.set_title('Logistic Functions')

# Set up grid
ax.grid(True, lw=1.75, ls='--', alpha=0.75)

# Put legend on figure
ax.legend(loc='best')

That's a good-looking plot!  Notice that we didn't need to have all those annoying `fontsize` specifications floating around.  If you want to reset the defaults, just use `plt.rcdefaults()`.

Now, how in the world did this work?  Obviously, there is something special about the `config` file.  I didn't give you a config file, but the next exercise requires you to create one.

<div class="exercise"><b>Exercise</b></div>
* Read the *matplotlib rcParams* section at the following page: [Customizing matplotlib](https://matplotlib.org/users/customizing.html)
* Create your very own `config.py` file.  It should have the following structure:
```python
pars = {}
```
You must fill in the `pars` dictionary yourself.  All the possible parameters can be found at the link above.  For example, if you want to set a default line width of `4`, then you would have 
```python
pars = {'lines.linewidth': 4}
```
  in your `config.py` file.
* Make sure your `config.py` file is in the same directory as your lab notebook.
* Make a plot (similar to the one I made above) using your `config` file.

### `seaborn`
Early on in this plotting section, I mentioned `seaborn`.  You can use `seaborn` to make very nice plots of statistical data.  Here is the main website:  [seaborn: statistical data visualization](https://seaborn.pydata.org/).

We won't dive deep into `seaborn` here.  It is quite popular in the data science community, but it is ultimately up to you whether or not you choose to use it.

`seaborn` works great with `pandas`.  It can also be customized easily.  Here is the basic `seaborn` tutorial: [Seaborn tutorial](https://seaborn.pydata.org/tutorial.html).

### No Excuses
With all of these resourses, there is no reason to have a bad figure. EVER.

## Part 13:  References

Congratulations!  You've completed lab 0.  You'll likely very comfortable with some parts of the lab, and less comfortable with other parts.  Below are a few tutorials and practice problems if you'd like to do more -- check them out.
<ol >
<li> Tutorial: https://realpython.com/files/python_cheat_sheet_v1.pdf </li>
<li> Short practice problems with solutions: http://www.practicepython.org</li>
<li> Jupyter notebooks (0, 1, and 2 are more relevant): https://github.com/jrjohansson/scientific-python-lectures </li>
</ol>
 