In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("pre02.ipynb")

<table style="width: 100%;">
<tr style="background-color: transparent;">
<td width="100px"><img src="https://cs104williams.github.io/assets/cs104-logo.png" width="90px" style="text-align: center"/></td>
<td>
  <p style="margin-bottom: 0px; text-align: left; font-size: 18pt;"><strong>CSCI 104: Data Science and Computing for All</strong><br>
                Williams College<br>
                Fall 2025</p>
</td>
</tr>


# Prelab 2: Tables

**Instructions**
- Before you begin, execute the cell at the TOP of the notebook to load the provided tests, as well as the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute these cells again.  
- Be sure to consult your [Python Reference](https://cs104williams.github.io/assets/python-library-ref.html)!
- Complete this notebook by filling in the cells provided. 
- Please be sure to not re-assign variables throughout the notebook.  For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously.
- There are no hidden tests in prelabs.

<hr/>
<h2>Setup</h2>


In [None]:
# Run this cell to set up the notebook.
# These lines import the numpy, datascience, and cs104 libraries.

import numpy as np
from datascience import *
from cs104 import *
%matplotlib inline

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 1. Python Review (10 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Practice using mathematical functions that are native to Python
</font>

The two building blocks of Python code are *expressions* and *statements*.  An **expression** is a piece of code that

* is self-contained, meaning it would make sense to write it on a line by itself, and
* usually evaluates to a value.

Here are two expressions that both evaluate to 3:

    3
    5 - 2
    
One important type of expression is the **call expression**. A call expression begins with the name of a function and is followed by the argument(s) of that function in parentheses. The function returns some value, based on its arguments. Some important mathematical functions are listed below.

| Function | Description                                                   |
|----------|---------------------------------------------------------------|
| `abs`      | Returns the absolute value of its argument                    |
| `max`      | Returns the maximum of all its arguments                      |
| `min`      | Returns the minimum of all its arguments                      |
| `pow`      | Raises its first argument to the power of its second argument |
| `round`    | Rounds its argument to the nearest integer                     |

Here are two call expressions that both evaluate to 3:

    abs(2 - 5)
    max(round(2.8), min(pow(2, 10), -1 * pow(2, 10)))

The expression `2 - 5` and the two call expressions given above are examples of **compound expressions**, meaning that they are actually combinations of several smaller expressions.  `2 - 5` combines the expressions `2` and `5` by subtraction.  In this case, `2` and `5` are called **subexpressions** because they're expressions that are part of a larger expression.

A **statement** is a whole line of code.  Some statements are just expressions.  The expressions listed above are examples.

Other statements *make something happen* rather than *having a value*. For example, an **assignment statement** assigns a value to a name. 

A good way to think about this is that we're **evaluating the right-hand side** of the equals sign and **assigning it to the left-hand side**. Here are some assignment statements:
    
    height = 1.3
    the_number_five = abs(-5)
    absolute_height_difference = abs(height - 1.688)

An important idea in programming is that large, interesting things can be built by combining many simple, uninteresting things.  The key to understanding a complicated piece of code is breaking it down into its simple components.

For example, a lot is going on in the last statement above, but it's really just a combination of a few things.  This picture describes what's going on.

<img src="statement.png" width="50%"/>

#### Part 1.1 (5 pts)


Hakeem is baking a cake. Recipe A calls for 225g grams of plain flour and Recipe B calls for 2.75 cups of flour. He wants to use the recipe that uses less flour. Hakeem converts both recipes to cups and then rounds to the nearest cup before he compares them. 

In the following cell, assign
- `recipe_a_cups` to the number of cups (rounded to the nearest whole cup) used by recipe A
- `recipe_b_cups` to the number of cups (rounded to the nearest whole cup) used by recipe B
- `min_recipe_cups` to the number of whole cups used by the recipe with the *fewest* cups 
- `recipe_with_least_flour` to a string, either "A" or "B," corresponding to the recipe with the fewest whole cups

Be sure to check your work by executing the test cell afterward.

*Hints:* 
- A cup of all-purpose flour typically weighs 120 grams
- Which functions (`abs`, `max`, `min`, `pow`, `round`) will you need to use? 

In [None]:
recipe_a_cups = ...
recipe_b_cups = ...
min_recipe_cups = ...
recipe_with_least_flour = ...

In [None]:
grader.check("p1.1")

#### Part 1.2 (5 pts)


In the next cell, assign the name `mystery` to the larger number among the following two numbers:

1. the **absolute value** of $2^{5}-2^{11}-2^1 + 1$ 
2. $5 \times 13 \times 31 + 8$.

Try to use just one statement (one line of code). Be sure to check your work by executing the test cell afterward.


In [None]:
mystery = ...
mystery

In [None]:
grader.check("p1.2")

We've asked you to use one line of code in the question above because it only involves mathematical operations and demonstrates a large, complex **statement**. 

However, more complicated programming questions will more require more steps. It isn’t always a good idea to jam these steps into a single line because it can make the code harder to read and harder to debug.

Good programming practice involves splitting up your code into smaller steps and using appropriate names. You'll have plenty of practice in the rest of this course!

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 2. Table Manipulation with Farmers Market Data (30 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Practice using built-in table functions: `.show(), .num_columns, .num_rows, .select(), .drop(), .sort(), .where()`
</font>

The table `farmers_markets.csv` contains data on farmers' markets in the United States  (data collected [by the USDA](https://apps.ams.usda.gov/FarmersMarketsExport/ExcelExport.aspx)).  Each row represents one such market.

Run the next cell to load the `farmers_markets` table.

In [None]:
# Just run this cell

farmers_markets = Table.read_table('farmers_markets.csv')

Let's examine our table to see what data it contains.

#### Part 2.1: `.show()` (5 pts)


Use the method `show` to display the first 5 rows of `farmers_markets`. 

*Note:* The terms "method" and "function" are technically not the same thing, but for the purposes of this course, we will use them interchangeably.

**Hint:** `tbl.show(3)` will show the first 3 rows of `tbl`. Additionally, make sure not to call `.show()` without an argument, as this may crash your kernel!

In [None]:
...

In [None]:
grader.check("p2.1")

Notice that some of the values in this table are missing, as denoted by `"nan."` This means either that the value is not available (e.g., if we don’t know the market’s street address) or not applicable (e.g., if the market doesn’t have a street address). You'll also notice that the table has a large number of columns in it

Before continuing, let's look up the `show` method in our online [Python Reference](http://cs104williams.github.io/assets/python-library-ref.html).  Go to that page and either scroll down the list of "Table Functions and Methods" to the entry for `show` or jump to it directly with the popup menu in the top right corner.  Familiarize yourself with the content of that row in the table, which describes how to use the method, the parameters it may take, and the result it computes.  Click on the row to see several simple examples of using the method.  Consult this documentation often!

#### Part 2.2: `.num_columns` and `.num_rows` (5 pts)


The table **property** `num_columns` returns the number of columns in a table. (A **property** is just a method that doesn't need to be called by adding parentheses.) 

Example call: `tbl.num_columns`

Look up `num_columns` in our online [Python Reference](http://cs104williams.github.io/assets/python-library-ref.html), and use `num_columns` to find the number of columns in our farmers' markets dataset.  Assign the number of columns to `num_farmers_markets_columns`.


In [None]:
num_farmers_markets_columns = ...
print("The table has", num_farmers_markets_columns, "columns in it!")

In [None]:
grader.check("p2.2")

Similarly, the property `num_rows` tells you how many rows are in a table.

In [None]:
# Just run this cell

num_farmers_markets_rows = farmers_markets.num_rows
print("The table has", num_farmers_markets_rows, "rows in it!")

#### Part 2.3: `.select()` (5 pts)


Most of the columns are about particular products -- whether the market sells tofu, pet food, etc.  If we're not interested in that information, it just makes the table difficult to read.  This comes up more than you might think, because people who collect and publish data may not know ahead of time what people will want to do with it.

In such situations, we can use the table method `select` to choose only the columns that we want in a particular table. It takes any number of arguments. Each should be the name of a column in the table. It returns a new table with only those columns in it. The columns are in the order *in which they were listed as arguments*.

For example, the value of `farmers_markets.select("MarketName", "State")` is a table with only the name and the state of each farmers' market in `farmers_markets`.

Use `select` to create a table with only the name, city, state, latitude (`y`), and longitude (`x`) of each market.  Call that new table `farmers_markets_locations`.

*Hint:* Make sure to be exact when using column names with `select`; double-check capitalization!

In [None]:
farmers_markets_locations = ...
farmers_markets_locations

In [None]:
grader.check("p2.3")

#### Part 2.4:  `.drop()` (5 pts)


`drop` serves the same purpose as `select`, but it takes away the columns that you provide rather than the ones that you don't provide. Like `select`, `drop` returns a new table.

Suppose you just didn't want the `FMID` and `updateTime` columns in `farmers_markets`.  Create a table that's a copy of `farmers_markets` but doesn't include those columns.  Call that table `farmers_markets_without_fmid`.


In [None]:
farmers_markets_without_fmid = ...
farmers_markets_without_fmid

In [None]:
grader.check("p2.4")

#### Part 2.5: `.sort()` (5 pts)


Now, suppose we want to answer some questions about farmers' markets in the US. For example, which market(s) have the greatest longitude (given by the `x` column)? To answer this, we'll sort `farmers_markets_locations` by longitude.

In [None]:
farmers_markets_locations.sort('x')

Oops, that didn't answer our question because we sorted from smallest to greatest longitude. To look at the greatest longitudes, we'll have to sort in reverse order.

In [None]:
farmers_markets_locations.sort('x', descending=True)

The `descending=True` bit is called a *named argument*. It has a default value of `False`, so when you explicitly tell the function `descending=True`, then the function will sort in descending order.


Some details about sort:

1. The first argument to `sort` is the name of a column to sort by.
2. If the column has text in it, `sort` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `farmers_markets_locations.sort("x")` is a *copy* of `farmers_markets_locations`; the `farmers_markets_locations` table doesn't get modified. For example, if we called `farmers_markets_locations.sort("x")`, then running `farmers_markets_locations` by itself would still return the unsorted table.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the `x` column, the farmers' markets would all end up with the wrong longitudes.

Create a version of `farmers_markets_locations` that's sorted by **latitude (`y`)**, with the greatest latitudes first.  Call it `farmers_markets_locations_by_latitude`.


In [None]:
farmers_markets_locations_by_latitude = ...
farmers_markets_locations_by_latitude

In [None]:
grader.check("p2.5")

#### Part 2.6: `.where()` (5 pts)


Now let's say we want a table of all farmers' markets in Massachusetts. Sorting won't help us much here because Massachusetts is close to the middle of the dataset.  Instead, we use the table method `where`.

In [None]:
ma_farmers_markets = farmers_markets_locations.where('State', are.equal_to('Massachusetts'))
ma_farmers_markets

Let's dive into our use of `where` in more detail.  `where` takes 2 arguments:

1. The name of a column.  `where` finds rows where that column's values meet some criterion.
2. A **predicate** that describes the criterion that the column needs to meet. The predicate in the example above called the function `are.equal_to` with the value we wanted, 'Massachusetts'.  We'll see other predicates soon.

The `where` method returns a new table that's a copy of the original table, but **with only the rows that meet the given predicate**.

Use `ma_farmers_markets` to create a table called `williamstown_markets` containing farmers' markets in Williamstown.

In [None]:
williamstown_markets = ...
williamstown_markets

In [None]:
grader.check("p2.6")

So far we've only been using `where` with the predicate that requires finding the values in a column to be *exactly* equal to a certain value. However, there are many other predicates. See our [Python Reference](http://cs104williams.github.io/assets/python-library-ref.html#where) for a complete list.

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 3. Creating Tables (10 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Create tables from arrays using `Table().with_columns()`
- Create tables from .csv files using `Table.read_table()`

So far, we've looked at and manipulated **tables**. But let's dig into what tables are and how we can create them. 

An **array** is useful for describing a single attribute of each element in a collection. For example, let's say our collection is all US States. Then an array could describe the land area of each state. 

Tables extend this idea by containing multiple columns stored and represented as arrays, each one describing a different attribute for every element of a collection. In this way, tables allow us to not only store data about many entities but to also contain several kinds of data about each entity. 

For example, in the cell below we have two arrays. The first one, `population_amounts`, contains the world population in each year (estimated by the US Census Bureau). The second array, `years`, contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays.

In [None]:
# Just run this cell
population_amounts = Table.read_table("world_population.csv").column("Population")
years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", years)

Suppose we want to answer this question:

> In which year did the world's population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`Table`*, which is a two-dimensional dataset with both rows and columns. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments,
- assigns the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. The names `population_amounts` and `years` were assigned above to two arrays of the **same length**. The function `with_columns` (you can find the documentation [here](http://cs104williams.github.io/assets/python-library-ref.html)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns). The strings and arrays are separated by commas.

In [None]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population

Now the data is combined into a single table! It's much easier to parse this data. If you need to know what the population was in 1959, for example, you can tell from a single glance.

#### Part 3.1 (5 pts)


 In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Name" and "Rating", which hold `top_10_movie_names` and `top_10_movie_ratings` respectively.



In [None]:
top_10_movie_names = make_array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')
top_10_movie_ratings = make_array(9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8)

top_10_movies = ...

# We've put this next line here 
# so your table will get printed out 
# when you run this cell.
top_10_movies

In [None]:
grader.check("p3.1")

#### Part 3.2 (5 pts)


In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we load them in from an external source, like a data file. There are many formats for data files, but CSV ("comma-separated values") is the most common.

`Table.read_table(...)` takes one argument (a path to a data file in **string** format) and returns a table.  

The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

In [None]:
imdb = ...
imdb

In [None]:
grader.check("p3.2")

Where did `imdb.csv` come from? Take a look at this lab's folder (in the left sidebar). You should see a file called `imdb.csv`.

Double-click to open up the `imdb.csv` file in that folder and look at the format. What do you notice? Jupyter displays the contents of the file as a basic table, but it is really just a specially-formatted text file.  In particular, the `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

<hr class="m-0" style="border: 3px solid #500082;"/>

# You're Done!
Follow these steps to submit your work:
* Run the tests and verify that they pass as you expect. 
* Choose **Save Notebook** from the **File** menu.
* **Run the final cell** and click the link below to download the zip file. 

Once you have downloaded that file, go to [Gradescope](https://www.gradescope.com/) and submit the zip file to 
the corresponding assignment. For Prelab N, the assignment will be called "Prelab N Autograder".

Once you have submitted, your Gradescope assignment should show you passing all the tests you passed in your assignment notebook.


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)