In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab03.ipynb")

<img src="data6.png" style="width: 15%; float: right; padding: 1%; margin-right: 2%;"/>

# Lab 3 – Tables and Data Manipulation

## Data 6, Fall 2024

In this lab, we will be talking all about *Tables*. We use tables to store all sorts of data from sports statistics to population information. If there's data you have ever been curious about, it is very likely that the Internet has a table somewhere with that data!

Tables are integral to the foundation of Data Science, and we will go over how to **query** a table. **Querying** a table is, simply put, requesting information about the table. Some examples of common queries (in English, not code):

- How many data points are there?
- Which data points have a specific characteristic?
- What is the attribute of a specific data point?
- And many more!

There are so many ways we can use tables to get information we need, and there are several existing libraries in Python that we can use to do this! In this course, we will be using the `datascience` library. This is the standard library used both in Data 6 and Data 8 at UC Berkeley. If you take Data Science classes beyond those two, you'll learn more!

In [3]:
# Run this cell to load all required Python libraries 
import numpy as np
from datascience import *

import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")

import warnings
warnings.simplefilter('ignore')

## Table Creation

Let's take a look at a table in action. Python does not have any tables by default, so we can either *create a new table from scratch* or *import a table from a file*. First, let's see how we can make our own table from scratch.

We start out with an empty `Table` -- this is the same idea as having an empty array or string. Note that `Table` is capitalized and there is nothing in the parentheses.

In [5]:
# Run this cell to load an empty table
our_table = Table()
our_table

## Adding Data: `with_columns`

Now, let's put some data in our table! To do so, we use the `with_columns` method. This method requires **two arguments**:
1. The name of the column as a string
2. An array of values to put in the column

An example call to `with_columns` looks like: `my_tbl.with_columns("My New Column", my_array)`, where `my_tbl` is our table that we would like to add to and `my_array` is a previously-defined array.

Run the cell below to see how we can add multiple columns into `our_table`.

In [13]:
our_table = our_table.with_columns("Department", make_array("Data Science", "Economics", "Political Science", "Sociology"),
                                   "Course Number", make_array(6, 1, 2, 121))
our_table

Department,Course Number
Data Science,6
Economics,1
Political Science,2
Sociology,121


We need to make sure that the columns we add to the table have the same number of rows (the length of the array we pass in) as the table. Otherwise, we'll get an error.

Watch what happens if we try to add a new column that doesn't have enough data (you'll see an error!)

In [12]:
# Just run this cell
our_table.with_columns("Too Few Rows", make_array(1, 2, 4))

ValueError: Column length mismatch. New column does not have the same number of rows as table.

**Question 1.1**: Add a new column to `our_table` called `"Number of Students"` that contains the number of students in each department. Jedi tells you that the *Data Science* department has 240 students, the *Economics* department has 905 students, the *Political Science* Department has 209 students, and the *Sociology* department has 63 students. 

Assign this new table to the variable `our_table_new_column`


In [14]:
# Our table has 4 rows, so our new column needs an array with 4 items, 1 for each row
students_array = make_array(240, 905, 209, 63)
our_table_new_column = our_table.with_column("Number of Students", students_array)
our_table_new_column

Department,Course Number,Number of Students
Data Science,6,240
Economics,1,905
Political Science,2,209
Sociology,121,63


In [15]:
grader.check("q1_1")

In [16]:
# This is our final table!
# You may use this cell to explore the table and see what you can do with it so far!
our_table_new_column

Department,Course Number,Number of Students
Data Science,6,240
Economics,1,905
Political Science,2,209
Sociology,121,63


## Table attributes: `num_rows` and `num_columns`

We can ask for all sorts of information about the table itself:

In [17]:
our_table_new_column.num_rows

4

In [18]:
our_table_new_column.num_columns

3

## Accessing Data: `column`

We can also ask about the data in the table using the `column` method. As mentioned in lecture, we can pass in a `label` or an `index` to this method. 

*Note*: Recall that index into the columns of a table using **zero-indexing** -- `0` corresponds to the first column, `1` corresponds to the second, etc....

In [19]:
# Converts a column in the table to an array
our_table_new_column.column("Department")

array(['Data Science', 'Economics', 'Political Science', 'Sociology'],
      dtype='<U17')

In [22]:
# Same thing, but uses column index instead of label
our_table_new_column.column(1)


array([  6,   1,   2, 121])

**Question 1.2**: Find the average number of students in each department by first accessing the `"Number of Students"` column as an array and then taking the average. Assign the average to `avg_num_students`.

*Note*: You may use any `np` functions here


In [23]:
avg_num_students = np.mean(our_table_new_column.column("Number of Students"))
avg_num_students

354.25

In [24]:
grader.check("q1_2")

## Loading a Table

Although creating our own tables by hand can be useful, more often than not the data we want to work with is far too large to be able to type out by hand. More commonly, we load datasets in from other sources using the `Table.read_table()` method. We can pass in a *file path* to this method and it will load that data into a table we can use in Python!

### Background on the Data

The dataset that we'll use in this lab comes from the Behavioral Risk Factor Surveillance System (BRFSS), a health survey fielded by the Centers for Disease Control and Prevention (CDC). From the [BRFSS website](https://www.cdc.gov/brfss/index.html):
>The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.

>By collecting behavioral health risk data at the state and local level, BRFSS has become a powerful tool for targeting and building health promotion activities. 

The dataset that you will investigate is a **subset of the 2022 BRFSS Survey**. We've taken all the data points corresponding to fully-completed surveys and, in our opinion, the most interesting columns. Since the entire data set is so large, we've randomly sampled 100,000 rows from the original data. While we've wrangled and cleaning the data set you'll use in your investigation, you're welcome to investigate the original source; you can do so via the [Survey Data section](https://www.cdc.gov/brfss/data_documentation/index.htm) of the BRFSS site.

## Seeing a table: `show()`

The use of the `show()` method **displays** the first n rows of a table. Like `print()` this does not return a value, it just displays the value to us at the end of a cell.

In [28]:
# Run this cell to load in the data
brfss = Table.read_table("data/brfss.csv")
brfss.show(5)

State,Day,Month,Year,Cell Phone,College Housing (Cell),College Housing (Landline),General Health,Physical Health,Mental Health,Personal Doctor,Health Plan,Sleep Time,Exercise,Diabetes,Diabetes Age,Binge Drinks,Sex,Children,100 Cigarettes,Days Smoking,Income Lower,Income Upper
Nebraska,2,2,2020,Yes,Missing,Missing,4,7,0,Yes,Yes,6,Yes,No,-1,0,Male,0,Yes,Never,10000,14999
West Virginia,1,6,2020,Missing,Missing,Missing,3,0,0,Yes,Yes,7,No,Yes,55,-1,Female,0,No,Missing,20000,24999
Georgia,0,6,2020,Missing,Missing,Missing,1,0,0,Yes,Yes,7,Yes,No,-1,-1,Female,1,No,Missing,50000,74999
Florida,2,8,2020,Missing,Missing,Missing,4,-1,0,Yes,Yes,10,No,Yes,80,-1,Male,0,Yes,Never,35000,49999
Nevada,1,6,2020,Yes,Missing,Missing,5,-1,7,Yes,No,8,Yes,No,-1,0,Male,0,No,Missing,-1,-1


**Question 2.1:** Fill in the `num_rows_brfss` and `num_columns_brfss` with the number of rows and columns in the original `brfss` table, respectively.


In [29]:
num_rows_brfss = brfss.num_rows
num_columns_brfss = brfss.num_columns
print(f"Our `brfss` table has {num_rows_brfss} rows and {num_columns_brfss} columns")

Our `brfss` table has 100000 rows and 23 columns


In [31]:
grader.check("q2_1")

### Investigating our Data

Now that we've loaded our data into the `brfss` table, let's take a closer look at its columns. Run the following cell to output the column names. 

In [32]:
brfss.labels

('State',
 'Day',
 'Month',
 'Year',
 'Cell Phone',
 'College Housing (Cell)',
 'College Housing (Landline)',
 'General Health',
 'Physical Health',
 'Mental Health',
 'Personal Doctor',
 'Health Plan',
 'Sleep Time',
 'Exercise',
 'Diabetes',
 'Diabetes Age',
 'Binge Drinks',
 'Sex',
 'Children',
 '100 Cigarettes',
 'Days Smoking',
 'Income Lower',
 'Income Upper')

Based on these column names, it looks like the data includes questions about **telecommunications**, **housing**, **demographic information**, **mental and physical health**, **alcohol and drug consumption**, and **physical exercise**. Each column in the `brfss` table corresponds to a question asked in the official BRFSS Survey.

## Data Manipulation

Now that we have a solid understanding of the basic table methods from the `datascience` library, let's put our skills to use! Even with a few tools, we are already able to arrive at powerful realizations about real world data.

**Question 2.2**: Assign `num_alabama_rows` to the the number of times the name **Alabama** appeared in the `brfss` table.

*Hint*: Use the table methods you've learned above!


In [35]:
alabama_tbl = brfss.where("State", "Alabama")
num_alabama_rows = alabama_tbl.num_rows
num_alabama_rows

1352

In [None]:
grader.check("q2_2")

Take a closer look at some of the columns in the `brfss` table. For the next two questions, we will be looking at the `"Binge Drinks"` column, which corresponds to this survey question:
> Considering all types of alcoholic beverages, how many times during the past 30 days did you have 5 or more
drinks for men or 4 or more drinks for women on an occasion? 

Notice that this column contains negative values, most notably `-1`. Why might this be the case? Discuss with the people around you.

In [36]:
# Run this cell
brfss.column("Binge Drinks")

array([ 0., -1., -1., ...,  0.,  0.,  0.])

**Question 2.3**: Create a new table called `missing_binge_drinks` which only contains rows from the `brfss` table where there is a `-1` in the `"Binge Drinks"` column.


In [37]:
missing_binge_drinks = brfss.where("Binge Drinks", -1)
missing_binge_drinks

State,Day,Month,Year,Cell Phone,College Housing (Cell),College Housing (Landline),General Health,Physical Health,Mental Health,Personal Doctor,Health Plan,Sleep Time,Exercise,Diabetes,Diabetes Age,Binge Drinks,Sex,Children,100 Cigarettes,Days Smoking,Income Lower,Income Upper
West Virginia,1,6,2020,Missing,Missing,Missing,3,0,0,Yes,Yes,7,No,Yes,55,-1,Female,0,No,Missing,20000,24999
Georgia,0,6,2020,Missing,Missing,Missing,1,0,0,Yes,Yes,7,Yes,No,-1,-1,Female,1,No,Missing,50000,74999
Florida,2,8,2020,Missing,Missing,Missing,4,-1,0,Yes,Yes,10,No,Yes,80,-1,Male,0,Yes,Never,35000,49999
Maine,0,10,2020,Missing,Missing,Missing,2,0,0,Yes,Yes,6,Yes,Yes,55,-1,Male,0,Yes,Never,75000,-1
Arizona,0,4,2020,Yes,Missing,Missing,1,0,0,No,Yes,8,Yes,No,-1,-1,Male,2,Yes,Some Days,-1,-1
Texas,1,7,2020,Yes,Missing,Missing,4,0,0,No,Yes,8,No,No,-1,-1,Male,0,Yes,Every Day,15000,19999
South Carolina,0,11,2020,Missing,Missing,Missing,2,0,0,Yes,Yes,7,Yes,No,-1,-1,Female,0,No,Missing,75000,-1
Maine,2,2,2020,Missing,Missing,Missing,2,0,0,Yes,Yes,7,No,No,-1,-1,Male,0,Yes,Never,75000,-1
Ohio,1,9,2020,Yes,Missing,Missing,1,0,0,Yes,Yes,5,No,No,-1,-1,Male,2,Yes,Never,50000,74999
Massachusetts,1,2,2020,Yes,Missing,Missing,1,7,20,Yes,Yes,6,No,Yes,57,-1,Female,0,No,Missing,25000,34999


In [38]:
grader.check("q2_3")

<!-- BEGIN QUESTION -->

**Question 2.4 (*Discussion*):** Say we wanted to find the average of one of the columns from our original table. How does the inclusion of `-1` values *affect* this average? If we removed all the negative values, how would the average change?

Then, once you've answered, run the following cell to confirm your understanding.


The inclusion of -1 decreases the average. If we removed all the negative values, the average would increase. The average would likely be more representative once we remove all the negative values.

<!-- END QUESTION -->



In [44]:
brfss_no_negatives_children = brfss.where("Children", are.not_equal_to(-1))
print("With negatives: " + str(np.average(brfss.column('Children'))))
print(f"Without negatives: {np.average(brfss_no_negatives_children.column('Children'))}")

With negatives: 0.49044
Without negatives: 0.5016119932296285


**Question 2.6:** Using the `no_missing_income` table we've provided for you, determine the **average income** for respondents who:
1. Have a health insurance plan ("Yes" in "Health Plan" column)
2. Do not have a health insurance plan ("No" in "Health Plan" column)
3. Refused to answer this question ("Declined to Answer" in "Health Plan" column)

You may use the starter code provided for you, but are not required to.


In [50]:
no_missing_income = brfss.where("Income Upper", are.not_equal_to(-1))
health_plan = no_missing_income.where("Health Plan", "Yes")
no_health_plan = no_missing_income.where("Health Plan", "No")
declined = no_missing_income.where("Health Plan", "Declined to Answer")
average_plan = np.mean(health_plan.column("Income Upper"))
average_no_plan = np.mean(no_health_plan.column("Income Upper"))
average_declined = np.mean(declined.column("Income Upper"))

print(f"Respondents with a Health Insurance Plan made: \t\t${round(average_plan, 2)}")
print(f"Respondents without a Health Insurance Plan made: \t${round(average_no_plan, 2)}")
print(f"Respondents who refused to answer made: \t\t${round(average_declined, 2)}")

Respondents with a Health Insurance Plan made: 		$43832.35
Respondents without a Health Insurance Plan made: 	$34940.18
Respondents who refused to answer made: 		$30963.91


In [51]:
grader.check("q2_6")

With just over a week of data science under your belts, you're already able to uncover underlying **patterns and trends** within real world data. In this case, we've found that those with health insurance plans make, on average, almost 9,000 more dollars a year than those without health insurance plans.

What's valuable, too, is the information we gain when a population is *missing* or, in this case, *declines to participate*. Notice that the average salary of respondents who refused to answer was around 4,000 dollars lower than those *without* a Health Insurance Plan.

## Done! 😇

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)