In [None]:
# Setup.
import otter
grader = otter.Notebook()

import pandas as pd
df = pd.DataFrame().assign(x=[1, 2, 3], y=['A', 'A', 'B'])

x = 6

<img src="images/logo.png" width=200>

_**JupyterCon 2023 • May 12th, 2023**_
## Otter-Grader: A Lightweight Solution for Creating and Grading Jupyter Notebook Assignments

<br>

**Suraj Rampure, UC San Diego, rampure@ucsd.edu**

**Christopher Pyles, UC Berkeley and Google, cpyles@berkeley.edu**

<br>

<center><h3>Follow along at <a href="https://tinyurl.com/otter-paris">tinyurl.com/otter-paris</a>!</h3></center>

#### Suraj Rampure

TODO

#### Christopher Pyles

TODO

### Autograding

**What is it?** An automatic method of grading students' code which doesn't require manual reading.

**How does it work?** An autograder re-executes a student's solution code and runs provided tests on the resulting environment.

**Why?** Scale.

### What is Otter-Grader?

Otter-Grader is a **light-weight**, **modular**, **open-source** autograder designed to grade programming assignments for classes at **any scale**.

- Works with both Python and R.

- Designed for Jupyter Notebooks but compatible with other formats.

- Includes tooling for assignment development and distribution.

- Supports local execution or third-party services like Gradescope.

- **Core abstraction: You provide the compute, Otter does the rest!**

### Setup

To create or work on assignments, all one needs to do is install Otter [via `pip`](https://pypi.org/project/otter-grader/):

<br>

```
pip install otter-grader
```

### Agenda

1. Demonstration of the assignment authoring, release, and collection process.

2. Discussion of various use-cases and extensions.

3. Shortcomings and future plans.

## Demo

### Generating student-facing notebooks

Otter works best in assignments where students are provided exposition and **skeleton code** and need to fill in the blanks themselves.

For instance:

---

Below, complete the implementation of the function `a_cubed_plus_b_squared`, which takes in two numbers `a` and `b` and returns the cube of `a` added to the square of `b`.

In [None]:
def a_cubed_plus_b_squared(a, b):
    a_cubed = ...
    b_squared = ...
    return a_cubed + b_squared

---

When students run the following cell, their implementation will be tested on a series of **public test cases**, which typically check whether their answer is of the right data type or in the right range.

In [None]:
grader.check("q1_3")

### Source notebooks

The corresponding instructor-facing source code may look like this:

---

Below, complete the implementation of the function `a_cubed_plus_b_squared`, which takes in two numbers `a` and `b` and returns the cube of `a` added to the square of `b`.

```
BEGIN QUESTION
name: q1_3
```

In [None]:
def a_cubed_plus_b_squared(a, b):
    a_cubed = a ** 3 # SOLUTION
    b_squared = b ** 2 # SOLUTION
    return a_cubed + b_squared

In [None]:
## TEST ##
callable(a_cubed_plus_b_squared)

In [None]:
## TEST ##
a_cubed_plus_b_squared(5, 2) == 129

In [None]:
## HIDDEN TEST ##
a_cubed_plus_b_squared(-2, -4) == 8

Note that while working on an assignment, students are not told whether their work passes **hidden tests**, which typically check for correctness more thoroughly than public tests.

### One workflow

1. Instructors create a source notebook.

2. Instructors run `otter assign` to generate artifacts (a student-facing notebook and an autograder zip file).

3. Instructors distribute student-facing notebooks to students.

4. Instructors upload the autograder zip file to the course grading platform.

5. Students submit their completed notebooks to the course grading platform and see their grades instantly.

### Step 1: Creating source notebooks

Source notebooks consist of:

- Skeleton code with solutions, as shown previously.

- Test cases, both public and hidden.

- Question-level metadata (e.g. number of points, whether the question should be set aside for human grading).

- Assignment-level metadata (e.g. files to include, environment requirements).

<br>

<center>(see demo)</center>

### Step 2: Running `otter assign` to generate artifacts

Suppose `proj.ipynb` is a source notebook. After the source notebook is complete, an instructor might run:

<br>

```
otter assign src/proj03.ipynb build/
```

Then, in `build/`, they'd find:

- An `autograder/` directory, which contains an `autograder.zip` file.

- A `student/` directory, which contains the student-facing notebook, along with any specified files and images.

<br>

<center>(see demo)</center>

### Step 3: Distributing student-facing notebooks

Students need to have access to the `student/` folder produced by `otter assign`. Instructors can provide them this directory by:

- Uploading a zip of the directory to a course learning management system (LMS).

- Pushing the directory to a public course GitHub repository and asking students to pull it.

- Pushing the directory to a public course GitHub repository and using `nbgitpuller` to automatically pull the repo and open the relevant assignment on a JupyterHub.

#### Example: `nbgitpuller`

<center><img src="images/nbgitpuller-screenshot.png" width=80%><br>Screenshot taken from <a href="https://dsc10.com">dsc10.com</a>.</center>

The bolded hyperlink points to:

http://datahub.ucsd.edu/user-redirect/git-sync?repo=https://github.com/dsc-courses/dsc10-2023-sp&subPath=homeworks/hw02/hw02.ipynb

### Step 4: Configuring the autograder

The `autograder.zip` file that is generated contains a `requirements.txt` file and all of the test cases that need to be run on students' work.

Often, instructors upload the generated `autograder.zip` file to an autograding platform, such as Gradescope.

<center><img src="images/config-autograder.png" width=50%></center>

### Step 5: Collecting student work

- If using Gradescope, students can upload their finished `.ipynb` files directly.

- Upon submission, they will only see the results of their code on the public tests, which they already had access to in their notebook.

- However, the hidden tests are also run on their code, but their hidden test scores aren't shown until instructors release grades.

- Questions can be manually-graded if desired (e.g. questions in which students must create a plot or interpret results).

<center><img src="images/student-view.png" width=75%>What students see when they submit.<br>Note that they're only shown their results on the public tests, <b>not</b> the hidden tests.</center>

<center><img src="images/instructor-view.png" width=80%>What instructors see when students submit, and what students will see once grades are released.</center>

<center><img src="images/manual-grading.png" width=80%>The grading view that instructors see when grading students' notebooks manually.</center>

### Alternatives to steps 4 and 5

- Instead of uploading `autograder.zip` to an autograding platform, instructors can instead ask students to submit their completed notebooks to any file upload service (any LMS, Google Drive, etc.).

- Then, instructors can download all submissions at once and run the autograder locally, using `otter grade` command and the Otter Docker image.

- The result is a CSV of assignment scores for each student, which can then be imported into any LMS.

- This doesn't require having access to proprietary systems, such as Gradescope.

## Use-cases and extensions

### Adopters

- UC Berkeley.
    - Used in **very** large classes. Examples:
        - Data 8: 1500 students/semester.
        - Data 100: 1000 students/semester.
    - Most classes use a campus-hosted JupyterHub server.
    - Mostly Python, but some R.
        - [TODO: Example R notebook](TODO).

- UC San Diego.
    - DSC 10: Works similarly to Berkeley's Data 8 (uses campus-hosted JupyterHub server).
    - DSC 80: Uses an **Otter extension where separate .py and .ipynb files are created**.

- Various other universities, community colleges, and even high schools.

### An extension to Otter

- In one of our courses – DSC 80 at UC San Diego – the philosophy taught is that:
    - Notebooks are for experimentation.
    - Library code should exist in `.py` files.

- In this course, each assignment consists of:
    - A notebook, which contains question prompts and data imports.
    - A `.py` file, which contains function stubs.
    - **Students only submit the `.py` file!**

- By default, Otter takes in a source notebook and generates a student-facing notebook. We built an extension that takes in a source notebook and generates separate student-facing notebooks and `.py` files.
    - Not currently public or officially part of Otter, but ask Suraj.

<center><img src="images/dsc80-notebook.png" width=75%>A student-facing notebook in DSC 80, generated by our Otter extension.</center>

<center><img src="images/dsc80-py.png" width=75%>A student-facing <code>.py</code> file in DSC 80, generated by our Otter extension.</center>

### Pre-Otter assignment authoring in DSC 80

- This course existed well before otter.

- Previously, instructors would have to separately update a student-facing notebook, a student-facing `.py` file, **and** separate test case `.py` files for each question. This made it difficult to change anything about the assignments.

- **With Otter and this extension, the assignment revision process has become significantly more convenient.**

## Shortcomings and future plans

### Writing test cases is an art and a science

**Your evaluation is only as good as your tests!** Issues:

- Randomness.
    - When students' code involves randomness, how do we write test cases that capture the variance in their possible responses?

- Visibility of public tests.
    - What if you'd like to give students public test cases that they can't look at (and overfit to)?
    - One solution: hashing.
    - Related feature: failure messages.

- Non-deterministic forms.
    - What if we want to check whether a student created a `DataFrameGroupBy` object?

In [None]:
df

In [None]:
df.groupby('y')

- Output matching.
    - Suppose we ask students to assign `x` to the result of dividing 12 by 6.
    - Suppose our source notebook has `x = 6`, while their code has `x = 6.0`.

In [None]:
# This passes only if x is 6, not 6.0!

## TEST ##
x

In [None]:
# This passes either way!

## TEST ##
x == 6

### Other shortcomings

Otter...

- Encourages guess-and-check behavior.

- Isn't supported in third-party frontends, like Google Colab.
    - Newer versions of Otter require `rawnbconvert` cells, which Colab doesn't have.

- Does not support question randomization in any way.
    - Limits its usefulness for exams.

### Future plans

- TODO