In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab03.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Lab 03: Tables

---

## References

* [Sections 6.0 - 6.4 of the Textbook](https://ccsf-math-108.github.io/textbook/chapters/06/Tables.html)
* [datascience Documentation](https://datascience.readthedocs.io/)
* [Markdown Cheat Sheet](https://www.markdownguide.org/cheat-sheet/)

---

## Lab Assignment Reminders

- 🚨 Make sure to run the code cell at the top of this notebook that starts with `# Initialize Otter` to load the auto-grader.
- Your tasks are categorized as auto-graded (📍) and manually graded (📍🔎):
    - **For all auto-graded tasks:**
        - Replace the `...` in the provided code cell with your own code.
        - Run the `grader.check` code cell to execute tests on your code.
        - There are no hidden auto-grader tests in the lab assignments. This means if you pass the tests, you can assume you've completed the task successfully.
    - **For all manually graded tasks:**
        - You may need to provide your own response to the provided prompt. Replace the template text "_Type your answer here, replacing this text._" with your own words.
        - You might need to produce a graphic or another output using code. Replace the `...` in the code cell to generate the image, table, etc.
        - In either case, check your response with a classmate, a tutor, or the instructor before moving on.
- Throughout this assignment and all future ones, please **do not re-assign variables** throughout the notebook! _For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you may fail tests that you thought you were passing previously!_
- You may [submit](#Submit-Your-Assignment-to-Canvas) this assignment as many times as you want before the deadline. Your instructor will score the last version you submit once the deadline has passed.
- **Collaborating on labs is encouraged!** You should rarely remain stuck for more than a few minutes on questions in labs, so ask an instructor or classmate for help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) However, please don't just share answers.

---

## Configure the Notebook

Run the following cell to configure this Notebook.

In [None]:
import numpy as np
from datascience import *

---

## Loading Data

---

### CSV Files

The CSV file `farmers_markets.csv` contains data on farmers' markets in the United States.  (The was data collected from [the USDA's website](https://apps.ams.usda.gov/FarmersMarketsExport/ExcelExport.aspx)). CSV (comma-separated values) refers to how the information in the file is organized. If you run the following code cell, you'll see the first 3 lines of the file and probably notice **how challenging it is to read**.

In [None]:
!head -n 3 farmers_markets.csv

To better display that information and to provide you with a collection of tools to work with the data, we will guide you in storing that CSV into a `Table` format.

---

### `read_table`

The `read_table` function from the `datascience` library helps you load the contents from a CSV file and store the information within a `Table`.

Run the next cell to load the `farmers_markets` table.

In [None]:
farmers_markets = Table.read_table('farmers_markets.csv')

After running that code cell, `farmers_markets` represents the table of information. In this lab and some future assignments, you are going to focus on working with these tables. If you run the following code cell, you'll see the contents of the table. 

* By default, Jupyter displays the first 10 rows of a table.
* Notice that it shows `... (1671 rows omitted)` below the displayed table.

In [None]:
farmers_markets

Hopefully, that is a little easier to visualize compared to looking at the direct contents of the related CSV file! Now, you didn't do all of that just to visualize the information, you unlocked a collection of tools for you to engage with that data.

---

## Attributes and Methods

---

When you create something in Python, such as a `Table` (or any object), you gain access to a collection of properties (attributes) and functions (methods) associated with that object's data type. These properties and methods provide ways to interact with, manipulate, and retrieve information from the object.

*There are often many attributes and methods available, so it's common to use documentation and tools like artificial intelligence to explore and understand them. In this course, you will focus on a specific subset of these tools. We encourage you to use our reference materials and commit some of this syntax to memory rather than relying on artificial intelligence for learning.*

---

## Some `Table` Attributes and Methods

---

### `show`

Earlier, you were able to preview the `farmers_market` table by using its name, but how can you look at a specific number of rows or the entire table? The `show` table method in the `datascience` library is used to *display* a subset of rows from a `Table`. This method is handy for quickly inspecting the data.

* By default, `tbl.show()` displays all of the rows from the table `tbl`.
* You can specify the number of rows to display by passing a number as an argument.

For example, the following code shows the first 10 rows of the `farmers_market` table.

In [None]:
farmers_markets.show(10)

This method is for visual inspection only and does not return a `Table` data type. Run the following code to see that the 10 rows are displayed, but the display has a `NoneType` for a data type.

In [None]:
type(farmers_markets.show(10))

In other words, running `farmers_markets` and `farmers_markets.show(10)` **do not do the same thing!**

---

#### Task 01 📍🔎

Use the method `show` to display the first 5 rows of `farmers_markets`.

Make sure to check your visualization with a classmate, a tutor, or the instructor before moving on since there is no auto-grader for this lab task.

**Important Note:** Be careful to not to call `.show()` without an argument. There is a lot of information in the farmers market table and showing it all will crash your kernel!

<!-- BEGIN QUESTION -->



In [None]:
...

<!-- END QUESTION -->

Notice that some of the values in this table are missing, as denoted by `nan`. This means either that the value is not available (e.g. if we don’t know the market’s street address) or not applicable (e.g. if the market doesn’t have a street address). You'll also notice that the table has a large number of columns in it!

---

### `num_columns`

The table attribute `num_columns` returns the number of columns in a table. Running `tbl.num_columns` would provide you with the number of columns in the table `tbl`. Notice that `num_columns` is not a function, so you don't use `()` at the end. Think of `num_columns` as a variable name with specific stored information about the table.

---

#### Task 02 📍

Use `num_columns` to find the number of columns in our farmers' markets dataset.

Assign the number of columns to `num_farmers_markets_columns`.

In [None]:
num_farmers_markets_columns = ...
print("The table has", num_farmers_markets_columns, "columns in it!")

In [None]:
grader.check("task_02")

---

### `num_rows`

Similarly, the attribute `num_rows` tells you how many rows are in a table. Run the following cell to see how that table property can be accessed.

In [None]:
num_farmers_markets_rows = farmers_markets.num_rows
print("The table has", num_farmers_markets_rows, "rows in it!")

---

### `select` and `drop`

Most of the columns in `farmers_markets` are about particular products -- whether the market sells tofu, pet food, etc.  If we're not interested in that information, it just makes the table difficult to read and potentially slows down the computer if we work with the table. This comes up more than you might think, because people who collect and publish data may not know ahead of time what people will want to do with it.

In such situations, we can use the table method `select` to choose only the columns that we want in a particular table. It takes any number of arguments. Each should be the name of a column in the table. It returns a new table with only those columns in it. The columns are in the order *in which they were listed as arguments*.

For example, the value of `farmers_markets.select("MarketName", "State")` is a table with only the name and the state of each farmers' market in `farmers_markets`.

---

#### Task 03 📍

Use `select` to create a table with only the name, city, county, state, longitude (`x`), and latitude (`y`) of each market, in that order. Assign that new table to the name `farmers_markets_locations`. 

**Note:** We didn't create the column names (and their format), they come directly from the United States Department of Agriculture data resource. Make sure to be exact when using column names with `select`. Double-check capitalization! Also, in this task, the order of the columns matters.

In [None]:
farmers_markets_locations = ...
farmers_markets_locations

In [None]:
grader.check("task_03")

---

### `drop`

`drop` serves the similar purpose as `select`, but it takes away the columns that you provide rather than the ones that you don't provide. Like `select`, `drop` returns a new table.

For example, the following code would create a copy of the `farmers_markets` table without the `FMID` and `updateTime` columns.

In [None]:
farmers_markets.drop("FMID", "updateTime")

---

### `where`

Now let's say we want a table of all farmers' markets in California. We can use the table method `where` to do this. 

Run the following cell to filter the table to include only California farmers' markets.

In [None]:
california_farmers_markets = farmers_markets_locations.where('State', are.equal_to('California'))
california_farmers_markets

Ignore the syntax for the moment.  Instead, try to read that line like this:

> Assign the name **`california_farmers_markets`** to a table whose rows are the rows in the **`farmers_markets_locations`** table **`where`** the **`'State'`**s **`are` `equal` `to` `California`**.

Let's dive into the details a bit more.  `where` takes 2 arguments:

1. The name of a column.  `where` finds rows where that column's values meet some criterion.
2. A predicate that describes the criterion that the column needs to meet.

The predicate in the example above called the function `are.equal_to` with the value we wanted, 'California'.  We'll see other predicates soon.

`where` returns a table that's a copy of the original table, but **with only the rows that meet the given predicate**.

---

#### Task 04 📍

Use `california_farmers_markets` to create a table called `sf_markets` containing farmers' markets in San Francisco, California. 

**Note:** These aren't all of the markets in the city, but they are the markets listed in USDA's data set.

In [None]:
sf_markets = ...
sf_markets

In [None]:
grader.check("task_04")

So far we've only been using `where` with the predicate that requires finding the values in a column to be *exactly* equal to a certain value. However, there are many other predicates. Here are a few:

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|
|`are.containing`|`are.containing('Data Science')`|Find rows with values that contain the substring `'Data Science'`|

Next, you are going to practice using some of those predicates to filter the data in the table. For additional predicates, see the Code Reference [Table Filtering Predicates section](https://ccsf-math-108.github.io/materials-sp25/resources/code-reference.html#table-filtering-predicates).

---

#### Task 05 📍

There are many cities in California with the "San" (which means "Saint" in Spanish) in their name due to Spanish colonization and the establishment of Catholic missions during the 18th and early 19th centuries.

Assign `san_farmers_markets` to a table containing all the information from `california_farmers_markets` where the city contains the sub-name `San `.

**Note**: Make sure to include the space after "San", so you don't include places like Santa Clara.

In [None]:
san_farmers_markets = ...
san_farmers_markets

In [None]:
grader.check("task_05")

---

#### Task 06 📍

<a href="https://commons.wikimedia.org/wiki/File:Location_Map_San_Francisco_Bay_Area.png"><img src="./Bay_Area.png" alt="A map of the San Francisco Bay Area" width=400px></a>

Wikipedia **defines** the San Francisco Bay Area by the following border coordinates:

* 38.2033 - Latitude at top edge of map, in decimal degrees
* 37.1897 - Latitude at bottom edge of map, in decimal degrees
* -122.6445 - Longitude at left edge of map, in decimal degrees
* -121.5871 - Longitude at right edge of map, in decimal degrees

Assign `bay_area_farmers_markets` to a table that only contains the information from the `farmers_markets` table within the described Bay Area coordinates. 

**Notes**: 
* Remember that you can chain together functions like `where`.
* It doesn't matter if you include the coordinate boundaries in this task.
* **Defining** what the Bay Area is might be a divisive topic! 🫣

In [None]:
top = ...
bottom = ...
left = ...
right = ...
bay_area_farmers_markets = ...
bay_area_farmers_markets

In [None]:
grader.check("task_06")

---

### `take`

You might recall from early in this notebook that the `tbl.show(5)` will display the first 5 rows of the table `tbl`, but it doesn't output a table. So, what should you use if you want to actually make a table with only the first 5 rows of some other table?

The `take` table method can help you do this. If you run `tbl.take(5)`, then you won't end up with the first 5 rows of `tbl`. Instead, you'll produce a table with the 6th row of the `tbl` because it assumed that `5` refers to the row index 5 (aka the 6th row). 

Run the following code cell to see this.

In [None]:
farmers_markets.take(5)

In order to get the first 5 rows, you need to provide an array of row indices as an argument for `take`. Specifically, you need the row indices 0, 1, 2, 3, 4. Thankfully, you learned about `np.arange` previously as a way to generate a sequence of numbers like this (i.e. `np.arange(5)`).

#### Task 07 📍

Assign `first_5_markets` to a table containing the information from the first 5 rows of `farmers_markets`.

In [None]:
first_5_markets = ...
first_5_markets

In [None]:
grader.check("task_07")

---

### `group`

The `farmers_markets` table is initially set up to have one market per row, but what if you want to analyze the markets from the State perspective? For example, it could be helpful to count how many markets there are in each state.

The `group` table method in the `datascience` library is a powerful tool for summarizing and aggregating data based on the unique values within one or more columns. It works by grouping rows in a table based on the distinct values in one or more specified columns.

Run the following cell which takes `farmers_market_locations` and creates a new table that will group the farmers' markets by `'State'` and list the counts for the number of farmers' markets in each state. We will go into greater depth with the `group` method later in the course. Notice that the column of counts automatically gets the label of `count`.

In [None]:
farmers_markets_locations.group('State')

#### Task 08 📍

For this task, use `group` to assign `markets_by_county` to a table containing a row for each county in `california_farmers_markets` with a column (`'County'`) for the name of the county and a column (`'count'`) showing the count of markets within that county.

In [None]:
markets_by_county = ...
markets_by_county

In [None]:
grader.check("task_08")

---

### `sort`

Notice that `farmers_markets_locations.group('State')` sorted the information such that the states are in alphabetical order. If you want to re-organize the table to show the states with the most markets at the top, then you'd want to use the `sort` table method.

You need to provide `sort` with the column name (or index) you want to sort by and the information will be sorted in ascending order by default.

Run the following cell to find the states in the data set with the least farmers' markets.

In [None]:
farmers_markets_locations.group('State').sort('count')

If you want the largest counts at the top of the table, you'd need to sort in descending order. You can do this by adjusting the `descending` argument to have a value of `True`. Specifically, you could type `.sort('count', descending=True)`.

---

#### Task 09 📍

Assign `sorted_markets_by_county` to a table containing the information in `markets_by_county` sorted such that the counts are in descending order and the counties with the largest counts are at the top of the table.

In [None]:
sorted_markets_by_county = ...
sorted_markets_by_county

In [None]:
grader.check("task_09")

---

## Combining Commands and Using Documentation.

There are many more table operations. It is challenging to memorize them all and you do not need to do that! Instead, you can reference documentation such as the [`datascience` documentation page on `Tables`](https://datascience.readthedocs.io/en/master/tables.html) to get a summary of all the operations for tables.

#### Task 10 📍

Using a variety of tools you've learned about in this assignment and the [`relabeled` command](https://datascience.readthedocs.io/en/master/_autosummary/datascience.tables.Table.relabeled.html#datascience.tables.Table.relabeled) from the documentation, assign `top_10` to a table with two columns `'City'` and `'Number of Markets'` which contains the 10 cities with the most number of farmers markets according to the `farmers_markets` data set.

**Note**:
This task involves several tasks, we recommend trying to break this up into smaller steps and use intermediate variable names to store your progress.

In [None]:
top_10 = ...
top_10

In [None]:
grader.check("task_10")

Great work so far! There is a lot to learn about `Tables`. Some of it we will guide you in specifically and other things require you to look through the documentation or ask for help about.

---

## Submit Your Assignment to Canvas

Follow these steps to submit your lab assignment:

1. **Check the Assignment Completion Requirements:** This assignment is scored as Complete or Incomplete. Make sure to check with your instructor about their requirements for a Complete score. 
2. **Run the Auto-Grader:** Ensure you have executed the code cell containing the command `grader.check_all()` to run all tests for auto-graded tasks marked with 📍. This command will execute all auto-grader tests sequentially.
3. **Complete Manually Graded Tasks:** Verify that you have responded to all the manually graded tasks marked with 📍🔎.
4. **Save Your Work:** In the notebook's Toolbar, go to `File -> Save Notebook` to save your work and create a checkpoint.
5. **Download the Notebook:** In the notebook's Toolbar, go to `File -> Download HTML` to download the HTML version (`.html`) of this notebook.
6. **Upload to Canvas:** On the Canvas Assignment page, click "Start Assignment" or "New Attempt" to upload the downloaded `.html` file.

---

## Attribution

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a> and derived from the <a href="https://www.data8.org/">Data 8: The Foundations of Data Science</a> offered by the University of California, Berkeley.

<img src="./by-nc-sa.png" width=100px>

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()