# Lab 19

Today, we will dig a bit more into evaluating the efficacy of our code via _benchmarking._ Today'a goals are: 

0. Define different kinds of benchmarking 
1. Deploy different kinds of benchmarking

## Machine Learning at Scale

Part of working on large data and complex code is to explore how long each piece of code is taking. Today, we will return to our implementation of gradient descent and use that as our example. To prepare for that, please import the data from Lab 9 and add to the below functions (from Lab 11):

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import numpy as np
from numpy import linalg as LA
import pandas as pd

import timeit ## <-- New line!
import time

# import line_profiler
# import memory_profiler

from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier

In [None]:
## Functions for later use 
    
def compute_mse(truth_vec, predict_vec):
    return np.mean((truth_vec - predict_vec)**2)
    
def compute_m_partial(in_vals, truth_vec, predict_vec):
    return -2*np.mean(in_vals*(truth_vec-predict_vec))

def compute_b_partial(truth_vec, predict_vec):
    return -2*np.mean(truth_vec-predict_vec)

def grad_des(input_data, truth_vec, max_steps):
    # Add your implementation for gradient descent here
    pass

def minibatch_gd(input_data, truth_vec, batch_size, max_steps):
    # Add your implementation for mini-batch gradient descent here
    pass

def stochastic_gd(input_data, truth_vec, max_steps):
    # Add your implementation for stochastic gradient descent here
    pass

In [None]:
## Import Data

employ_data = pd.read_csv("../Lab09-Parameters/lab9data.csv", sep = ",")

## numpy vectors of our inputs
neuro = employ_data[["neuroticism"]].to_numpy()
perform = employ_data[["performance"]].to_numpy()

In [None]:
## Import Weather Data

weather_data = np.genfromtxt("../Lab16-DecisionTrees/lab16data.csv", delimiter=',', skip_header=1)
weather_pd = pd.read_csv("../Lab16-DecisionTrees/lab16data.csv", sep = ",")

# Split into the input variables and the target classes
in_weather = weather_data[:,:4]
out_class = weather_data[:,4]

# Get the variable names 
var_names = list(weather_pd.columns)[:4]

## Timing our implementations with `timeit`

Last time, we used the `time` module to time how long it takes an implementation to run. Another option is to use **magic** built in python commands to check individual lines. These commands take the form of `%command` for a single line and `%%command` for a block of code. Today, we will use a few of them starting with `%%timeit`.

Noting that there can be small changes in run time, in `timeit`, we run the code several times to find the average run time. Notice how the output between `%%time` and `%%timeit` are different: 

In [None]:
%%timeit

# Specify and fit the model
grove = RandomForestClassifier(n_estimators=10, max_features = 3, max_depth=2, random_state=0)
grove.fit(in_weather, out_class)


In [None]:
%%time

# Specify and fit the model
grove = RandomForestClassifier(n_estimators=10, max_features = 3, max_depth=2, random_state=0)
grove.fit(in_weather, out_class)

## Benchmarking

Last time (and above), we looked at how long a whole algorithm would take to do its work. This is a good first step, but when we _benchmark_ code, we examine how fast each piece of the code is. This means that we need a system that tells us how each piece runs. We could do this by running a time line for each individual piece of our code, OR we could use a _profiler_ which gives us more nuanced information about the run times without us having to insert additional timing lines. 

Benchmarking as a whole allows us to determine if we need to edit or change any lines due to _bottlenecking_ (or places where the code slows down). 

Today, we will look at three kinds of _profilers._ Specifically, we are looking at:
0. Profiling a whole script
1. Line by line Profiling
2. Memory Profiling 

For this part, we need to conda install a few things:
* `line_profiler`
* `memory_profiler`

The remainder of this lab follows this [blog post](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html).

### (First) Profiler

The first _magic_ line that we will use is `%prun` which is the profiler command. This is akin to `cProfile` or `profile` in usual [profilers for python scripting](https://docs.python.org/3/library/profile.html). This will tell you every piece (including deep parts of the base python) that are touched by your code. 

This profiler will give us [various timing information](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-prun). Run the below line and consult the linked helpfile to see what the profiler is telling you. 

In [None]:
%prun grove.fit(in_weather, out_class)

#### Your turn

Run `%prun` on your implementation of `grad_des` and on `minibatch_gd`. What is surprising about these two functions? 

### Line Profiler

Instead of looking at a full script, we might want to look how much time each line takes to execute. This will tells us how long each line we coded takes to run. 

To start this process, we will must load in the `line_profiler` as "magic" functions

In [None]:
%load_ext line_profiler

Once loaded as "magic" we can use `%lprun` which will run our functions, timing them line by line. Let's do a silly example. Before running the line profiler, which line do you think will take longer? 

In [None]:
def tune_fit(in_vars,classes):
    grove = RandomForestClassifier(n_estimators=10, max_features = 3, max_depth=2, random_state=0)
    grove.fit(in_vars,classes)

In [None]:
%lprun -f tune_fit tune_fit(in_weather, out_class)

#### Your turn

Run `%lprun` on your implementation of `grad_des` and on `minibatch_gd`. What is surprising about these two functions? 

### Memory Profiler

Instead of looking at timing of each line, we might want to look how much memory each line takes to execute as well as the total memory for the function. 

To start this process, we will must load in the `memory_profiler` as "magic" functions

In [None]:
%load_ext memory_profiler

Now that it is loaded as "magic", we can use two functions `%memit` which gives use the total memory usage and `%mprun` which gives us a line by line assessment of memory. 

Let's again turn to our a silly example. Before running the memory profiler, which line do you think will take longer? 

In [None]:
%memit tune_fit(in_weather, out_class)

#### Your turn

Run `%memit` on your implementation of `grad_des` and on `minibatch_gd`. What is surprising about these two functions? 

#### Bonus - `mprun` 

To run `mprun`, we do have to create a file for our example before we can run it. Again following the earlier blog post:

In [1]:
%%file mprun_demo.py

from sklearn.ensemble import RandomForestClassifier
import numpy as np
from numpy import linalg as LA
import pandas as pd

def tune_fit(in_vars,classes):
    grove = RandomForestClassifier(n_estimators=10, max_features = 3, max_depth=2, random_state=0)
    grove.fit(in_vars,classes)

Overwriting mprun_demo.py


In [2]:
from mprun_demo import tune_fit
%mprun -f tune_fit tune_fit(in_weather, out_class)

UsageError: Line magic function `%mprun` not found.


### Final Thoughts

To finish up this lab, answer the question: **what did you learn by benchmarking the three versions of gradient descent?** Share your thoughts in a post on **#lab_submission** channel on slack with your answer. Your post must start with **Lab19** to get credit.  

If your have questions from this lab, post them to #lab_questions with the same preamble (i.e. starting with **Lab19**). If you have the same question, please use one of the emoji's to upvote the question. If you would like to answer someone's question, please use the thread function. This will tie your answer to their question. 

### Next Time

We will start our journey into deep learning. 

#### Resources consulted 

0. [Benchmarking your code](https://rbspy.github.io/benchmarking-your-code/)
1. [IPython Magic Commands](https://jakevdp.github.io/PythonDataScienceHandbook/01.03-magic-commands.html)
2. [Profiling and Timing Code](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html)
3. [How do I get time of a Python program's execution?](https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution)