In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab02.ipynb")

<table style="width: 100%;">
<tr style="background-color: transparent;">
<td width="100px"><img src="https://cs104williams.github.io/assets/cs104-logo.png" width="90px" style="text-align: center"/></td>
<td>
  <p style="margin-bottom: 0px; text-align: left; font-size: 18pt;"><strong>CSCI 104: Data Science and Computing for All</strong><br>
                Williams College<br>
                Fall 2025</p>
</td>
</tr>


# Lab 2: Arrays and Tables

<hr style="margin: 0px; border: 3px solid #500082;"/>

<h2>Instructions</h2>

- Before you begin, execute the cell at the TOP of the notebook to load the provided tests, as well as the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute these cells again.  
- Be sure to consult your [Python Reference](https://cs104williams.github.io/assets/python-library-ref.html)!
- Complete this notebook by filling in the cells provided. For problems asking you to write explanations, you **must** provide your answer in the designated space. 
- Please be sure to not re-assign variables throughout the notebook.  For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously.
- This lab has hidden tests on it. That means even though tests may say 100% passed, doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the lab.
- To use one or more late days on this lab, please fill out our [late day form](https://forms.gle/4sD16h3hN1xRqQM27) **before** the due date.

<hr/>
<h2>Setup</h2>


In [None]:
# Run this cell to set up the notebook.
# These lines import the numpy, datascience, and cs104 libraries.

import numpy as np
from datascience import *
from cs104 import *
%matplotlib inline

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 1. Table Review and Analyzing a Movie Dataset (10 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Use built-in table functions to answer quantitative questions about a dataset
</font>

Now that you're familiar with table operations, let’s answer an interesting question about a dataset!  Run the cell below to load the `imdb` table. It contains information about the 250 highest-rated movies on IMDb, as you saw on the prelab this week.

In [None]:
# Just run this cell

imdb = Table.read_table('imdb.csv')
imdb

*A quick note on writing table operations:* Often, we want to perform multiple operations - sorting, filtering, or others - in order to turn a table we have into something more useful. You can do these operations one by one, e.g.

In [None]:
# Just run this cell

movies_from_2013 = imdb.where("Year", are.equal_to(2013))
ordered_2013_movies = movies_from_2013.sort("Title", descending=True)
ordered_2013_movies

However, since the value of the expression `original_tbl.where('col1', are.equal_to(12))` is itself a table, you can just call a table method on it:

In [None]:
# Just run this cell

ordered_2013_movies = imdb.where("Year", are.equal_to(2013)).sort("Title", descending=True)
ordered_2013_movies

You should organize your work in the way that makes the most sense to you, using informative names for any intermediate tables you create. 

#### Part 1.1 (5 pts)


Create a table of movies released between 2010 and 2016 (inclusive) with ratings above 8. The table should only contain the columns `Title` and `Rating`, **in that order**.

Assign the table to the name `above_eight`.

*Hint:* Think about the steps you need to take, and try to put them in an order that make sense. Feel free to create intermediate tables for each step, but please make sure you assign your final table the name `above_eight`!


In [None]:
above_eight = ...
above_eight

In [None]:
grader.check("p1.1")

#### Part 1.2 (5 pts)


What are the  *proportion* of movies in the dataset that were released 1900-1999, and the *proportion* of movies in the dataset that were released in the year 2000 or later?

Use `num_rows` and arithmetic. Assign `proportion_in_20th_century` to the proportion of movies in the dataset that were released 1900-1999, and `proportion_in_21st_century` to the proportion of movies in the dataset that were released in the year 2000 or later.

*Hint:* The *proportion* of movies released in the 1900's is the *number* of movies released in the 1900's, divided by the *total number* of movies.


In [None]:
num_movies_in_dataset = ...
num_in_20th_century = ...
num_in_21st_century = ...
proportion_in_20th_century = ...
proportion_in_21st_century = ...
print("Proportion in 20th century:", proportion_in_20th_century)
print("Proportion in 21st century:", proportion_in_21st_century)

In [None]:
grader.check("p1.2")

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 2. Array practice (25 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Practice creating arrays 
- Analyze the data in arrays via math operations and built-in functions such as `len()`, and indexing arrays
</font>

#### Part 2.1 (5 pts)


Make an array called `vowels` containing the following vowels, 'A', 'E', 'I', 'O', and 'U'.

In [None]:
vowels = ...
vowels

In [None]:
grader.check("p2.1")

These next exercises give you practice accessing individual items of arrays with the [array.item(index)](https://cs104williams.github.io/assets/python-library-ref.html#item) method.  In Python, items are accessed by its *index*; for example, the first item is the item at index 0. Indices must be **integers**.

***Note:* If you have previous coding experience, you may be familiar with bracket notation. DO NOT use bracket notation when indexing (i.e. `arr[0]`), as this can yield different data type outputs than what we will be expecting. This can cause you to fail an autograder test.**

Be sure to refer to the [Python Library Reference](https://cs104williams.github.io/assets/python-library-ref.html#item) if you feel stuck!

#### Part 2.2 (5 pts)


Set `third_vowel` to the third item of `vowels`.

In [None]:
third_vowel = ...
third_vowel

In [None]:
grader.check("p2.2")

#### Part 2.3 (5 pts)


You'll sometimes want to find the **last** item of an array.  What is the index of the last item in the five-item `vowels` array?

In [None]:
index_of_last_vowel = ...

In [None]:
grader.check("p2.3")

#### Part 2.4 (5 pts)


More often, you don't know the number of items in an array.  We call the number of items in an array its *length*.  The function `len` takes a single argument, an array, and returns the length of that array (an integer).

The cell below creates an array `fives` containing several thousand five-letter English words compiled by [Don Knuth](https://en.wikipedia.org/wiki/Donald_Knuth), one of the early and most influential pioneers in computer science, long before Wordle was a thing.

In [None]:
# Just run this cell 
fives = Table().read_table('fives.csv').column('Five Letter Words')
print(np.random.choice(fives, 10))  # show ten random words from the table

Set `first` and `last` to be the first and last words in the array, respectively. Use `len` to compute the index for the last word.

In [None]:
first = ...
last = ...

# No need to modify the line below 
print('fives has', len(fives), 'words, starting with', first, 'and ending with', last)

In [None]:
grader.check("p2.4")

#### Part 2.5 (5 pts)


The following array represents the radii of various spheres

In [None]:
radii = make_array(1,4,3,5,2)

Below, assign `volumes` to an array where each item is the volume of the sphere for each respective item of `radii`. 

*Hints:*
- Use **array broadcasting**. 
- Use the numpy library's `np.pi` variable for the value of $\pi$.

Recall that the volume V of a sphere with radius r is:

$$
V = \frac{4}{3} \pi r^3
$$

Thus, the volume for a sphere with radius 1 is approximately `4.1888` 

Then, assign `total_volume` to the total volume of all spheres using the [sum](https://cs104williams.github.io/assets/python-library-ref.html#sum) function

In [None]:
volumes = ...
print('volumes are', volumes)

total_volume = ...
print('total is', total_volume)

In [None]:
grader.check("p2.5")

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 3. Analyze Tables via Old Faithful Geyser data (20 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Use built-in Table functions, Python functions, and numpy (np) functions to answer quantitative questions about a dataset
</font>

[Old Faithful](https://en.wikipedia.org/wiki/Old_Faithful) is a geyser in Yellowstone that erupts every 44 to 125 minutes. People are [often told that the geyser erupts every hour](http://yellowstone.net/geysers/old-faithful/), but in fact the waiting time between eruptions is more variable. Let's take a look.

#### Part 3.1 (5 pts)


The first line below assigns `waiting_times` to an array of 272 consecutive waiting times between eruptions, taken from a classic 1938 dataset. Assign the names `shortest`, `longest`, and `average` so that the `print` statement is correct.

In [None]:
old_faithful = Table.read_table('old_faithful.csv')
waiting_times = old_faithful.column('waiting')

shortest = ...
longest = ...
average = ...

print("Old Faithful erupts every", shortest, "to", longest, "minutes and every", average, "minutes on average.")

In [None]:
grader.check("p3.1")

#### Part 3.2 (5 pts)


Assign `biggest_decrease` to the biggest decrease in waiting time between two consecutive eruptions. For example, the third eruption occurred after 74 minutes and the fourth after 62 minutes, so the decrease in waiting time was 74 - 62 = 12 minutes.

*Hint*: We want to return the absolute value of the biggest **decrease** (careful, this is *not* the biggest increase *or* decrease) between two consecutive erruptions.  You will want to read the documentation for [np.diff()](https://cs104williams.github.io/assets/python-library-ref.html#diff) to understand how it computes differences between adjacent items.

In [None]:
# np.diff() calculates the difference between subsequent values  
# in a NumPy array.
differences = np.diff(waiting_times) 
biggest_decrease = ...
biggest_decrease

In [None]:
grader.check("p3.2")

#### Part 3.3 (5 pts)


Suppose the surveyors started watching Old Faithful at the start of the first eruption. Assume that they watch until the end of the tenth eruption.  How many cummulative minutes will they spend waiting for eruptions?

*Hint:* One way to approach this problem is to use the `take` method on the table `faithful`. 

*Another Hint:* `first_nine_waiting_times` must be an array.

In [None]:
faithful = Table.read_table('old_faithful.csv')

faithful_first_nine_rows = ...
first_nine_waiting_times = ...
total_waiting_time_until_tenth = ...
total_waiting_time_until_tenth

In [None]:
grader.check("p3.3")

#### Part 3.4 (5 pts)


Let’s imagine your guess for the next waiting time was always just the length of the previous waiting time. If you always guessed the previous waiting time, how big would your error in guessing the waiting times be, on average?

For example, since the first four waiting times are 79, 54, 74, and 62, the average difference between your guess and the actual time for just the second, third, and fourth eruptions would be $\frac{|54-79|+ |74-54|+ |62-74|}{3} = 19$.  

*Note:* The notation $|54-79|$ indicates the absolute value of $54-79$.

In [None]:
differences = np.diff(waiting_times)
average_error = ...
average_error

In [None]:
grader.check("p3.4")

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 4. Analyze Multiple Tables with Fruit Stand Data (30 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Use built-in table functions, Python functions, and numpy functions to answer quantitative questions about a dataset
- Combine data from multiple Tables
</font>

#### Part 4.1 (5 pts)


The file `inventory.csv` contains information about the inventory at a fruit stand.  Each row represents the contents of one box of fruit. Load it as a table named `inventory` using the `Table.read_table()` function.

In [None]:
inventory = ...
inventory

In [None]:
grader.check("p4.1")

#### Part 4.2 (5 pts)


Does each box at the fruit stand contain a different fruit? Set `all_different` to `True` if each box contains a different fruit or to `False` if multiple boxes contain the same fruit.

*Hint:* You don't have to write code to actually calculate the True/False value for `all_different`. Just visually inspect at the `inventory` table and assign `all_different` to either `True` or `False` according to what you can see from the table.

In [None]:
all_different = ...
all_different

In [None]:
grader.check("p4.2")

#### Part 4.3 (5 pts)


The file `sales.csv` contains the number of fruit sold from each box in one day.  It has an extra column called "price per fruit (\$)" that's the price *per item of fruit* for fruit in that box.  The rows are in the same order as the `inventory` table.  Load these data into a table called `sales`.

In [None]:
sales = ...
sales

In [None]:
grader.check("p4.3")

#### Part 4.4 (5 pts)


What is the total count of fruit that the store sold on that day? 

In [None]:
total_fruits_sold = ...
total_fruits_sold

In [None]:
grader.check("p4.4")

#### Part 4.5 (5 pts)


What was the store's total revenue (the total price of all fruits sold) on that day?

*Hint:* If you're stuck, think first about how you would compute the total revenue from just the grape sales.

In [None]:
total_revenue = ...
total_revenue

In [None]:
grader.check("p4.5")

#### Part 4.6 (5 pts)


Make a new table called `remaining_inventory`.  It should have the same rows and columns as `inventory`, except that the amount of fruit sold from each box should be subtracted from that box's **original** count, so that the "count" column is **updated to be** the amount of fruit remaining after that day's sales.

In [None]:
remaining_inventory = ...

remaining_inventory

In [None]:
grader.check("p4.6")

<hr class="m-0" style="border: 3px solid #500082;"/>

# You're Done!
Follow these steps to submit your work:
* Run the tests and verify that they pass as you expect. 
* Choose **Save Notebook** from the **File** menu.
* **Run the final cell** and click the link below to download the zip file. 

Once you have downloaded that file, go to [Gradescope](https://www.gradescope.com/) and submit the zip file to 
the corresponding assignment. For Lab N, the assignment will be called "Lab N Autograder".

Once you have submitted, your Gradescope assignment should show you passing all the tests you passed in your assignment notebook.


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)