Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "Orion"
COLLABORATORS2 = ""

---

# Intro to Data Science

## Timing and Profiling

Based on Dr. VanderPlas' book. 
- Prof at University of Washington and Visting Researcher at Google

> *The Python Data Science Handbook* by Jake VanderPlas (O’Reilly). Copyright 2016 Jake VanderPlas, 978-1-491-91205-8.

Code: <br>
https://jakevdp.github.io/PythonDataScienceHandbook/


“A data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician.”
- Josh Blumenstock


![Data Science Venn Diagram](Data_Science_VD.png)

<small>(Source: [Drew Conway](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram))</small>

While some of the intersection labels are a bit tongue-in-cheek, this diagram captures the essence of what I think people mean when they say "data science": it is fundamentally an **interdisciplinary** subject.

Data science comprises three distinct and overlapping areas: the skills of 
- a **statistician** who knows how to model and summarize datasets (which are growing ever larger); 
- the skills of a **computer scientist** who can design and use algorithms to efficiently store, process, and visualize this data; 
- and the **domain expertise**—what we might think of as "classical" training in a subject—necessary both to formulate the right questions and to put their answers in context.

With this in mind, I would encourage you to think of data science not as a new domain of knowledge to learn, but a new set of **skills that you can apply** within your current area of expertise.

Whether you are reporting election results, forecasting stock returns, optimizing online ad clicks, identifying microorganisms in microscope photos, seeking new classes of astronomical objects, or working with data in any other field, the goal of this book is to give you the ability to **ask and answer new questions** about your chosen subject area.

## Timing Code Snippets: ``%timeit`` and ``%time``

We saw the ``%timeit`` line-magic and ``%%timeit`` cell-magic in the introduction to magic functions; it can be used to time the repeated execution of snippets of code:

In [None]:
%time %timeit sum(range(100))

#Exercise: Can we time how long timeit took?


Note that because this operation is so fast, ``%timeit`` automatically does a large number of repetitions.


For slower commands, ``%timeit`` will automatically adjust and perform fewer repetitions:

In [None]:
%%timeit
total = 0
for i in range(1000):
    for j in range(1000):
        total += i * (-1) ** j


Sometimes repeating an operation is not the best option.

For example, if we have a list that we'd like to sort, we might be misled by a repeated operation.

Sorting a pre-sorted list is much faster than sorting an unsorted list, so the repetition will skew the result:

In [2]:
import random
L = [random.random() for i in range(100000)]
print("L has length",len(L)," L[0:2]",L[0:2])
%timeit L.sort()


L has length 100000  L[0:2] [0.26277767893917625, 0.8611852161878162]
2.18 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


For this, the ``%time`` magic function may be a better choice. 

It also is a good choice for longer-running commands, when short, system-related delays are unlikely to affect the result.

Let's time the sorting of an unsorted and a presorted list:

In [None]:
import random
#see https://docs.python.org/3/library/random.html#random.random

L = [random.random() for i in range(100000)]
print("L has",len(L), "elements. L[0:2]=",L[0:2])
print("sorting an unsorted list:")
%time L.sort()


In [None]:
print("sorting an already sorted list:")
%time L.sort()


Notice how much faster the presorted list is to sort, but notice also how much longer the timing takes with ``%time`` versus ``%timeit``, even for the presorted list!

This is a result of the fact that ``%timeit`` does some clever things under the hood to prevent system calls from interfering with the timing.

For example, it prevents cleanup of unused Python objects (known as **garbage collection**) which might otherwise affect the timing.

For this reason, ``%timeit`` results are usually noticeably faster than ``%time`` results.

For ``%time`` as with ``%timeit``, using the double-percent-sign cell magic syntax allows timing of multiline scripts:

In [None]:
%%time
total = 0
for i in range(1000):
    for j in range(1000):
        total += i * (-1) ** j
        

For more information on ``%time`` and ``%timeit``, as well as their available options, use the IPython help functionality (i.e., type ``%time?`` at the IPython prompt).

## Profiling Full Scripts: ``%prun``

A program is made of many single statements, and sometimes timing these statements in context is more important than timing them on their own.

Python contains a built-in code profiler (which you can read about in the Python documentation), but IPython offers a much more convenient way to use this profiler, in the form of the magic function ``%prun``.

By way of example, we'll define a simple function that does some calculations:

In [None]:
def sum_of_lists(N):
    total = 0
    for i in range(5):
        L = [j ^ (j >> i) for j in range(N)]
#         Anyone remembers the ^ operator? 
        total += sum(L)
    return total

Now we can call ``%prun`` with a function call to see the profiled results:

In [None]:
%prun sum_of_lists(1000000)


In the notebook, the output is printed to the pager, and looks something like this:

```
14 function calls in 0.714 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5    0.599    0.120    0.599    0.120 <ipython-input-19>:4(<listcomp>)
        5    0.064    0.013    0.064    0.013 {built-in method sum}
        1    0.036    0.036    0.699    0.699 <ipython-input-19>:1(sum_of_lists)
        1    0.014    0.014    0.714    0.714 <string>:1(<module>)
        1    0.000    0.000    0.714    0.714 {built-in method exec}
```

The result is a table that indicates, in order of total time on each function call, where the execution is spending the most time. In this case, the bulk of execution time is in the list comprehension inside ``sum_of_lists``.

From here, we could start thinking about what changes we might make to improve the performance in the algorithm.

For more information on ``%prun``, as well as its available options, use the IPython help functionality (i.e., type ``%prun?`` at the IPython prompt).

## More! 

Other profiling options:

You can download extensions for more specific profilers.

- Line-By-Line Profiling: %lprun
- Profiling Memory: %memit and %mprun