# CPSC 330 Lecture 21

Outline:

- Announcements (5 min)
- Activity: explaining `GridSearchCV` (15 min)
- Principles of good explanations (15 min)
- Break (5 min)
- ML and decision-making (15 min)
- Decision-making activity (15 min)

Reminder to self: **turn on recording!**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import log_loss

In [2]:
plt.rcParams['font.size'] = 16

## Announcements (5 min)

- hw7 deadline passed.
- hw8 will be the last assignment. 
  - Tentative deadline: Thurs April 9 at 6pm.
- This Friday, April 3, is the last day of tutorials.
- There are 3 more lectures including today; the last one is Tuesday, April 7.
- More info coming soon on the final exam.
  - Tentative plan: similar to an assignment, open-book, but time-limited and no collaboration allowed.
  - Rationale: would do a reasonable job of testing the main points of the course; it would be harder to cheat than a closed-book exam.
- The exam is still April 24 from 12-2:30pm.
- We have added 10 office hours during April 8-23; see the [calendar](https://www.cs.ubc.ca/~mgelbart/calendar.html).
  - These will take place on Collaborate Ultra.

## Attribution

The content of this lecture is adapted from [DSCI 542](https://github.com/UBC-MDS/DSCI_542_comm-arg), created by [David Laing](https://davidklaing.com/).

## Activity: explaining `GridSearchCV` (15 min)

Below are two possible explanations of `GridSearchCV`. Let's assume the audience is someone with a CS background but no ML experience.

Read them both and then follow the instructions at the end.

#### Explanation 1

Machine learning algorithms, like an airplane's cockpit, typically involve a bunch of knobs and switches that need to be set.

![](https://i.pinimg.com/236x/ea/43/f3/ea43f3c7f3a8c92d884ce012c77628fd--cockpit-gauges.jpg)

For example, check out the documentation of the popular random forest algorithm [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). Here's a list of the function arguments, along with their default values (from the documentation):

> class sklearn.ensemble.RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

Holy cow, that's a lot of knobs and switches! As a machine learning practitioner, how am I supposed to choose `n_estimators`? Should I leave it at the default of 100? Or try 1000? What about `criterion` or `class_weight` for that matter? Should I trust the defaults?

Enter [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to save the day. The general strategy here is the choose the settings that perform best on the specific task of interest. So I can't say `n_estimators=100` is better than `n_estimators=1000` without knowing what problem I'm working on. For a specific problem, you usually have a numerical score that measures performance. `GridSearchCV` is part of the popular [scikit-learn](https://scikit-learn.org/) Python machine learning library. It works by searching over various settings and tells you which one worked best on your problem. 

The "grid" in "grid search" comes from the fact that tries all possible combinations on a grid. For example, if you want it to consider setting `n_estimators` to 100, 150 or 200, and you want it to consider setting `criterion` to `'gini'` or `'entropy'`, then it will search over all 6 possible combinations: `n_estimators=100, criterion='gini'`, `n_estimators=100, criterion='entropy'`, `n_estimators=150, criterion='gini'`, `n_estimators=150, criterion='entropy'`, `n_estimators=200, criterion='gini'`, and `n_estimators=200, criterion='entropy'`. So the "grid" in this case is a grid of 3 possible values by 2 possible values, for 6 points on the grid in total.

Here is a code sample that uses `GridSearchCV` to select from the 6 options we just mentioned. The problem being solved is classifying images of handwritten digits into the 10 digit categories (0-9). I chose this because the dataset is conveniently built in to scikit-learn:

In [4]:
# imports
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets

# load a dataset
data = datasets.load_digits()
X = data['data']
y = data['target']

# set up the grid search
grid_search = GridSearchCV(RandomForestClassifier(),
                           param_grid={
                                'n_estimators': [100, 150, 200],
                                'criterion': ['gini', 'entropy']
                           })

# run the grid search
grid_search.fit(X, y)
grid_search.best_params_

{'criterion': 'gini', 'n_estimators': 100}

As we can see from the output above, the grid search selected `n_estimators=150, criterion='entropy'`, which was one of our 6 options above.

By the way, these "knobs" we've been setting are called [_hyperparameters_](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning) and the process of setting these hyperparameters automatically is called [_hyperparameter optimization_](https://en.wikipedia.org/wiki/Hyperparameter_optimization) or _hyperparameter tuning_.

~400 words, not including code.

<br><br><br><br><br><br>

#### Explanation 2

https://medium.com/datadriveninvestor/an-introduction-to-grid-search-ff57adcc0998

~400 words, not including code.

<br><br><br><br><br><br>

#### Discussion questions:

- What do you like about each explanation?
- What do you dislike about each explanation?
- Which explanation do you think is more effective overall? Why?
- Each explanation has an image. Which one is more effective? What are the pros/cons?
- Each explanation has some sample code. Which one is more effective? What are the pros/cons?
- Are the two explanations aimed at similar audiences?
- Which explanation has likely helped more people?

After you're done reading, take ~5 min to consider the discussion questions above. Paste your answer to **at least one** of the above questions in [this Google doc](https://docs.google.com/document/d/1PsYKhHuF4aGYTCn2DLq3Klp2o6T0ZH-8zA6nq4PbYPI/edit?usp=sharing) under the appropriate question heading.

## Principles of good explanations (15 min)

#### Concepts *then* labels, not the other way around.

- The first explanation start with an analogy for the concept (and the label is left until the very end):

> Machine learning algorithms, like an airplane's cockpit, typically involve a bunch of knobs and switches that need to be set.

- In the second explanation, the first sentence is wasted on anyone who doesn't already know what "hyperparameter tuning" means:

> Grid search is the process of performing hyper parameter tuning in order to determine the optimal values for a given model. 


See [this video](https://twitter.com/ProfFeynman/status/899963856549625858?s=20): "I learned very early the difference between knowing the name of something and knowing something." -Richard Feynman.

#### Bottom-up explanations.

- The [Curse of Knowledge](https://en.wikipedia.org/wiki/Curse_of_knowledge) leads to *top-down* explanations:

![](img/top_down.png)

- When you know something well, you think about things in the context of all your knowledge. 
- Those lacking the context, or frame of mind, cannot easily understand. 

There is another way: *bottom-up* explanations:

![](img/bottom_up.png)

When you're brand new to a concept, you benefit from analogies, concrete examples and familiar patterns.

- In hindsight, perhaps this is why I was so [disappointed](https://piazza.com/class/k1gx4b3djbv3ph?cid=112) with [Lecture 6](https://github.students.cs.ubc.ca/cpsc330-2019w-t2/home/blob/master/lectures/06_feature-preprocessing.ipynb).
- I think starting each lesson from a dataset is a more authentic and "bottom-up" approach to teaching applied ML.

#### New ideas in small chunks.

The first explanation has a hidden conceptual skeleton:

1. The concept of setting a bunch of values.
2. Random forest example.
3. The problem / pain point.
4. The solution.
5. How it works - high level.
6. How it works - written example.
7. How it works - code example.
8. The name of what we were discussing all this time.

#### Examples from all angles.

When we're trying to draw mental boundaries around a concept, it's helpful to see examples on all sides of those boundaries. If we were writing a longer explanation, it might have been better to show more, e.g.

- Performance with and without hyperparameter tuning. 
- Other types of hyperparameter tuning (e.g. `RandomizedSearchCV`).

#### Reuse your running examples.

The first explanation using the same example throughout the text and code. This helps readers follow the line of reasoning.


#### When experimenting, show the results asap.

The first explanation shows the output of the code, whereas the second does not. This is easy to do and makes a big difference.

#### Exercise restraint: interesting to you != useful to the reader.

BTW, here is something I deleted from my explanation:

> Some hyperparameters, like `n_estimators` are numeric. Numeric hyperparameters are like the knobs in the cockpit: you can tune them continuously. `n_estimators` is numeric. Categorical hyperparameters are like the switches in the cockpit: they can take on (two or more) distinct values. `criterion` is categorical. 

It's a very elegant analogy! But is it helpful?

And furthermore, what is my hidden motivation for wanting to include it? Elegance, art, and the pursuit of higher beauty? Or _making myself look smart_? So maybe another name for this principle could be **It's not about you.**

## So, why should I care about effective communication?

- Most ML practitioners work in an organization with >1 people.
- There will very likely be stakeholders other than yourself.
- Those people need to understand what you're doing because:
  - their state of mind may change the way you do things (see below)
  - your state of mind may change the way they do things (interpreting your results)
- In my experience, ML suffers from some particular communication issues:
  - overstating one's results / unable to articulate the limitations
  - unable to explain the predictions
  - and the reason is: these things are actually very hard to explain!
    - Why did CatBoost make that prediction?
    - Can we trust test error?
    - What does it mean if `predict_proba` outputs 0.9?
    - Etc.

## Break (5 min)

## ML and decision-making (15 min)

- There is often a wide gap between what people care about and what ML can do.
- To understand what ML can do, let's think about what **decisions** will be made using ML. 

#### Decisions are just intentional manipulations of variables

They take two atomic forms:

- **How much?** (numeric variable)
- **Which one?** (categorical variable)

| How much? (numeric) | Which one? (categorical) |
| ------------------- | ------------------------ |
| ![](img/how_much.png) | ![](img/which_one.png)

Question: what principle of good explanations did I just violate?

<br><br><br><br><br><br>

Answer: I started top-down. Here's a version with examples:

- There is often a wide gap between what people care about and what ML can do.
  - e.g. "Create an algorithm that outputs future house prices"
  - Can ML do this?

Decisions take two atomic forms:

1. How much? (numeric variable)
  - e.g. How much should I list my house for?
2. Which one? (categorical variable)
  - e.g. Should I sell my house?


#### Every decision is a swirl of interconnected variables

- The **decision variable**: the variable that is manipulated through the decision.
  - E.g. how much should I sell my house for?
- The decision-maker's **objectives**: the variables that the decision-maker ultimately cares about, and wishes to manipulate indirectly through the decision variable.
  - E.g. my total profit.
- The **context**: the variables that mediate the relationship between the decision variable and the objectives.
  - E.g. the housing market, cost of marketing it.

#### The decision variable, and its values (the decision-maker's "alternatives")

- How much should I list my house for?
  - decision variable: number of dollars
  - alternatives: \\$400k, \\$450k, \\$500k.
- Should I sell my house?
  - decision variable: whether to sell.
  - alternatives: yes, no

#### The decision-maker's objectives: what does the decision aim to achieve?

- How much pasta should I cook for dinner? Objectives: 
  - Reduction in hunger (numeric)
  - Minimization of wasted food (numeric)
- Should I bring my raincoat? Objectives
  - Minimization of probability of getting wet (numeric)
  - Minimization of probability of carrying around needless weight (numeric)
  
Question: what principle of good explanations did I just violate?

<br><br><br><br><br><br>
Answer: changing the running example!

#### The context: what does the decision depend on?

- How much should I sell my house for? Context:
  - How much money do I need?
  - How much money do I think I can get?
  - How much time am I willing to wait for it to sell?
- Should I sell my house? Context:
  - What is the probability of the price going up?
  - What is the probability of the price going down?
  - How much do I need my house anyway?

#### The process that every ML project either augments or automates

1. Define objectives
2. Understand context
3. Evaluate alternatives based on objectives and context
4. Select an alternative

And, most commonly, it is (3) or (4).

#### How does this inform you as an ML practitioner?

Questions you have to answer:

- Who is the decision maker?
- What are their objectives?
- What are their alternatives?
- What is their context?
- What data do I need?

## Decision-making activity (15 min)

Consider the avocado price dataset from hw7. Let's say you work for Whole Foods, and they are wondering whether they should order more avocados this week or wait until next week.

Answer the following questions:

1. What are your decision variable(s) here?
2. Is the decision numeric or categorical? What are the alternatives? 
3. What are the objective(s)?

and then

4. What data do you need here?
5. What output might you show them from the model you trained in hw7?
6. How does the output connect to the decisions?
7. How would you present your results? What would you advise?

Take 5-10 min for this activity, and then we'll discuss afterwards. Paste your answer to **at least one** of the above questions in [this Google doc](https://docs.google.com/document/d/1PsYKhHuF4aGYTCn2DLq3Klp2o6T0ZH-8zA6nq4PbYPI/edit?usp=sharing) under the appropriate question heading.

## Summary

- Principles of effective communication
  - Concepts then labels, not the other way around.
  - Bottom-up explanations.
  - New ideas in small chunks.
  - Examples from all angles.
  - Reuse your running examples.
  - When experimenting, show the results asap.
  - It's not about you.
- Decision-making
  - Decision variables, objectives, and context.
  - How does ML fit in?
  
Next class we'll talk about communicating probabilities in your predictions, and we'll also talk about principles of effective visualizations. 