# Lab 18

Today, we will spend a bit of time on Decision Trees and Emsemble methods, before transitioning to examining the efficacy of an algorithm. Today, we will: 

0. Define standard metrics for efficacy
1. Work on timing implementations of algorithms

## Machine Learning at Scale

So far in the course, we have not paid particular attention to the efficacy of our algorithms. Instead, we have just coded for heuristic understanding of the underlying principles of machine learning. When we worked with Gradient Descent, we did consider the raw number of computations that would need to happen for each round of gradient descent. Today, we turn our attention to the standard measures for the efficacy of an implementation.

### Epoch

Often we see comparisons of an implementation's training error compared to the number of _epochs._ An epoch is the point where the algorithm has seen the whole set of training data. 

We have seen examples where an algorithm has to work repeatedly over the whole training dataset. Let's list at least 3 such examples: 

1. 
2. 
3. 

### Batch Size

There are other comparisons for an implementation's training error. Another common one is the _batch size_ or the number of training examples that the algorithm "sees" in each pass. 

What are two examples of where we see batch size as part of the design?

1. 
2. 

### Iterations

When an algorithm relies on batches of the training data, we often consider the number of _iterations_ of the algorithm needed to complete one epoch (or to "see" all the training data). This number can be directly computed from the total number of data points and the batch size. 

### Epochs, Batch Sizes, and Iterations, oh my!

Taking a closer look at a few of algorithms via these metrics:

<table>
<thead>
<tr>
<th></th>
<th>Epochs</th>
<th>Batch Size</th>
<th>Iterations</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>k-means</strong></td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td><strong>kNN</strong></td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td><strong>Gradient Descent</strong></td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td><strong>Mini-Batch Gradient Descent</strong></td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td><strong>Stochastic Gradient Descent</strong></td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td><strong>Decision Trees</strong></td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>

## Time? 

While we can talk about the number of passes of the data through an algorithm, it would be more helpful, if we could _time_ our algorithms. Today, we will do just that! 

In this example, we will use the random forest from last time. 

## Imports for today

In [None]:
## Import block
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier

import time

In [None]:
# For function testing 



In [None]:
## Import Data

weather_data = np.genfromtxt("../Lab16-DecisionTrees/lab16data.csv", delimiter=',', skip_header=1)
weather_pd = pd.read_csv("../Lab16-DecisionTrees/lab16data.csv", sep = ",")


In [None]:
# Split into the input variables and the target classes
in_weather = weather_data[:,:4]
out_class = weather_data[:,4]

# Get the variable names 
var_names = list(weather_pd.columns)[:4]

## Timing our implementations with `time`

Using the `time` module, we can time how long it takes us to get from the starting the the end of training a random forest on our weather data: 

In [None]:
# Start the timer: 
start_time = time.time()

# Specify and fit the model
grove = RandomForestClassifier(n_estimators=10, max_features = 3, max_depth=2, random_state=0)
grove.fit(in_weather, out_class)

# Stop the clock and determine the length of time
stop_time = time.time()

print("This took %s seconds to run" % (stop_time-start_time))

### Making comparisons

In this next section, we will compare your own implementations of k-means and kNN to the off-the-shelf ones. You will need:
1. Your k-means implementation from HW2
2. Your kNN implementation from Lab 7
3. The associated data for both labs

In [None]:
## Run a time analysis for your k-means

In [None]:
## Run a time analysis for sklearn's k-means

#### Comparing implementations

Which was faster? By how much? 

What do you think would happen if you had 100 million data points? 

In [None]:
## Run a time analysis for your kNN

In [None]:
## Run a time analysis for sklearn's kNN

#### Comparing implementations

Which was faster? By how much? 

What do you think would happen if you had 100 million data points? 

### Final Thoughts

To finish up this lab, answer the question: **Why are both epoch and run time important analysis for machine learning?** Share your thoughts in a post on **#lab_submission** channel on slack with your answer. Your post must start with **Lab18** to get credit.  

If your have questions from this lab, post them to #lab_questions with the same preamble (i.e. starting with **Lab18**). If you have the same question, please use one of the emoji's to upvote the question. If you would like to answer someone's question, please use the thread function. This will tie your answer to their question. 

### Next Time

We will look at benchmarking and bottlenecks. 

#### Resources consulted 

0. [Epoch vs Batch Size vs Iterations](https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9)
1. [Epoch, Iterations & Batch Size](https://medium.com/@ewuramaminka/epoch-iterations-batch-size-11fbbd4f0771)
2. [How do you calculate program run time in python?](https://stackoverflow.com/questions/5622976/how-do-you-calculate-program-run-time-in-python)
3. [How do I get time of a Python program's execution?](https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution)