In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw04.ipynb")

<img style="display: block; margin-left: auto; margin-right: auto" src="./ccsf-logo.png" width="250rem;" alt="The CCSF black and white logo">

# Homework 4: Visualizations

## Recommended Reading 

[Chapter 7 - Visualizations](https://inferentialthinking.com/chapters/07/Visualization.html) 

## Assignment Reminders

- Make sure to run the code cell at the top of this notebook that starts with `# Initialize Otter` to load the auto-grader.
- For all tasks indicated with a 🔎 that you must write explanations and sentences for, provide your answer in the designated space.
- Throughout this assignment and all future ones, please be sure to not re-assign variables throughout the notebook! _For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!_
- We encourage you to discuss this assignment with others, but make sure to write and submit your own code. Refer to the syllabus to learn more about how to learn cooperatively.
- View the related <a href="https://ccsf.instructure.com" target="_blank">Canvas</a> Assignment page for additional details.

Run the following code cell to import the tools for this assignment.

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Unemployment

_This section is continued from Homework 03._

The Federal Reserve Bank of St. Louis [FRED](https://fred.stlouisfed.org/categories/33509) publishes data about jobs in the US.  There are many ways of defining unemployment, and our dataset includes two notions of the unemployment rate:

1. Among people who are able to work and are looking for a full-time job, the percentage who can't find a job.  This is called the Non-Employment Index, or `NEI`.
2. Among people who are able to work and are looking for a full-time job, the percentage who are only working at a part-time job.  The latter group is called "Part-Time for Economic Reasons", so the acronym for this index is `PTER`.

_Note: The `Year` column has fractional years because FRED produces quarterly reports._

Run the following code cell to read the unemployment data as the table `unemployment`.

In [None]:
unemployment = Table.read_table('unemployment.csv')
unemployment

### Task 01 📍🔎

<!-- BEGIN QUESTION -->

Use the table `unemployment` to create a line plot of the PTER over time using one of the table methods you've learned from the course material. Your graph should show the `Year` values on the horizontal axis and the `PTER` values on the vertical axis.

_Points:_ 2

In [None]:
# Generate your chart in this cell
...

<!-- END QUESTION -->

### Task 02 📍🔎

<!-- BEGIN QUESTION -->

Again, using the table `unemployment`, create one visualization showing the trends of both `PTER` and `NEI` over time. Your graph should show the `Year` values on the horizontal axis.

_Points:_ 2

In [None]:
# Generate your chart in this cell
...

<!-- END QUESTION -->

### Task 03 📍🔎

<!-- BEGIN QUESTION -->

Write about any key features that are evident in the visualization from Task 02 and how they link to [The Great Recession](https://en.wikipedia.org/wiki/Great_Recession).

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Birthrates

_This section is continued from Homework 03._

The following table gives census-based population estimates for each state on both July 1, 2015 and July 1, 2016. The last four columns describe the components of the estimated change in population during this time interval. **For all questions below, assume that the word "states" refers to all 52 rows including Puerto Rico & the District of Columbia.**

* The data was taken from [the US Census 2010-2016 national totals data set](http://www2.census.gov/programs-surveys/popest/datasets/2010-2016/national/totals/nst-est2016-alldata.csv).
* If you want to read more about the different column descriptions, click [here](http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/totals/nst-est2015-alldata.pdf)!

The raw data is a bit messy - run the cell below to clean the table and make it easier to work with.

In [None]:
# Don't change this cell; just run it.
pop = Table.read_table('nst-est2016-alldata.csv').where('SUMLEV', 40).select([1, 4, 12, 13, 27, 34, 62, 69])
pop = pop.relabeled('POPESTIMATE2015', '2015').relabeled('POPESTIMATE2016', '2016')
pop = pop.relabeled('BIRTHS2016', 'BIRTHS').relabeled('DEATHS2016', 'DEATHS')
pop = pop.relabeled('NETMIG2016', 'MIGRATION').relabeled('RESIDUAL2016', 'OTHER')
pop = pop.with_columns("REGION", np.array([int(region) if region != "X" else 0 for region in pop.column("REGION")]))
pop.set_format([2, 3, 4, 5, 6, 7], NumberFormatter(decimals=0)).show(5)

### Task 04 📍

In the next task, you will be creating a visualization to understand the relationship between birth and death rates for the states. The annual death rate for a year-long period in a state is the total number of deaths in that period for that state as a proportion of the population size at the start of the time period for that state.

What visualization is most appropriate to see if there is an association between birth and death rates during a given time interval among the states?

1. Line Graph
2. Scatter Plot
3. Bar Chart

Assign `visualization` below to the number corresponding to the correct visualization.


_Points:_ 2

In [None]:
visualization = ...

In [None]:
grader.check("task_04")

### Task 05 📍🔎

<!-- BEGIN QUESTION -->

In the code cell below, create a visualization that will help us determine if there is an association between birth rate and death rate during this time interval among the states. It may be helpful to create an intermediate table here.


_Points:_ 2

In [None]:
# Generate your chart in this cell
...

<!-- END QUESTION -->

### Task 06 📍

Assign `assoc` to the number of the statement that best describes the association between birth rate and death rate during this time interval among the states.

1. There is a positive association between birth rate and death rate.
2. There is a negative association between birth rate and death rate.
3. There is no association between birth rate and death rate.


_Points:_ 1

In [None]:
assoc = ...

In [None]:
grader.check("task_06")

### Task 07 📍🔎

<!-- BEGIN QUESTION -->

Explain the visual evidence that supports your response to Task 06.

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Google Play Store Apps

_The following tasks are modified from UC San Deigo's [DSC 10](https://dsc10.com/) course work._

In the next few tasks, you will explore the [Google Play Store Apps Dataset](https://www.kaggle.com/lava18/google-play-store-apps) that was [scraped](https://en.wikipedia.org/wiki/Web_scraping) from the Google Play Store.

Run the following cell to load the data set.

In [None]:
google = Table.read_table('googleplaystore.csv')
google

Each row in the table corresponds to an app. Here are descriptions of some of the columns. 
- `'Category'`: Category the app belongs to.
- `'Rating'`: Overall user rating of the app out of 5 (at the time of data retrieval).
- `'Reviews'`: Number of user reviews for the app (at the time of data retrieval).
- `'Installs'`: Number of user downloads/installs for the app (at the time of data retrieval).
- `'Content Rating'`: Intended audience of the app, such as "Everyone" or "Teen".

**Note**: `'Rating'` and `'Content Rating'` mean different things. Don't get them mixed up!

You can see that `google` has 10,825 rows. This means that there are 10,825 apps listed in the dataset, but the app names are not unique! There are duplicates.

Run the following cell to calculate how many unique app names there are:

In [None]:
num_rows = google.num_rows
num_apps = google.group('App').num_rows
print(f"There are {num_rows:,} rows in the table, but there only {num_apps:,} apps!")

### Task 08 📍🔎

<!-- BEGIN QUESTION -->

Create a visualization showing the distribution of app ratings.

_Points:_ 2

In [None]:
# Generate your chart in this cell
...

<!-- END QUESTION -->

### Task 09 📍

There certainly seem to be a lot of excellent apps out there! It would be interesting to see whether the apps with higher ratings also have more reviews. 

What type of plot would you want to create to help you determine whether higher-rated apps also have more reviews? Assign either 1, 2, 3, or 4 to the name `plot_type` below.

1. scatter plot
2. line graph
3. bar graph
4. histogram

_Points:_ 2

In [None]:
plot_type = ...

In [None]:
grader.check("task_09")

### Task 10 📍🔎

<!-- BEGIN QUESTION -->

Now create the plot you identified above to help you determine whether higher-rated apps also have more reviews.

_Points:_ 2

In [None]:
# Generate your chart in this cell
...

<!-- END QUESTION -->

### Task 11 📍

Which of the following can we conclude, based on this data? Assign either 1, 2, 3, or 4 to the name `plot_conclusion` below.

1. Apps with higher ratings will become more popular, and since more people are using these apps, more reviews are given.
2. Apps with more reviews will become more popular, and since more people are using these apps, higher ratings are given.
3. Both 1 and 2.
4. Neither 1 nor 2.

_Points:_ 2

In [None]:
plot_conclusion = ...

In [None]:
grader.check("task_11")

### Task 12 📍🔎

<!-- BEGIN QUESTION -->

Create a visualization that shows the distribution of app content ratings. 

_Make sure that the most popular categories are shown at the top of the visualization._

_Points:_ 2

In [None]:
# Generate your chart in this cell
...

<!-- END QUESTION -->

## Marginal Histograms


Consider the following scatter plot: 

![](scatter.png)

The axes of the plot represent values of two variables: $x$ and $y$. 

Suppose we have a table called `t` that has two columns in it:

- `x`: a column containing the x-values of the points in the scatter plot
- `y`: a column containing the y-values of the points in the scatter plot

Below, you are given two histograms, each of which corresponds to either column `x` or column `y`. 

### Histogram A

![](var1.png)

### Histogram B

![](var2.png)

### Task 13 📍

Suppose we run `t.hist('x')`. Which histogram does this code produce? Assign `histogram_column_x` to either 1 or 2.

1. Histogram A
2. Histogram B


_Points:_ 1

In [None]:
histogram_column_x = ...

In [None]:
grader.check("task_13")

### Task 14 📍🔎

<!-- BEGIN QUESTION -->

State at least one reason why you chose the histogram from Task 13. Make sure to indicate which histogram you selected (ex: "I chose histogram A because ...").


_Points:_ 1

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Task 15 📍

Suppose we run `t.hist('y')`. Which histogram does this code produce? `Assign histogram_column_y` to either 1 or 2.

1. Histogram A
2. Histogram B


_Points:_ 1

In [None]:
histogram_column_y = ...

In [None]:
grader.check("task_15")

### Task 16 📍🔎

<!-- BEGIN QUESTION -->

State at least one reason why you chose the histogram from Task 15.  Make sure to indicate which histogram you selected (ex: "I chose histogram A because ...").

_Points:_ 1

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Submit your Homework to Canvas

Once you have finished working on the homework tasks, prepare to submit your work in Canvas by completing the following steps.

1. In the related Canvas Assignment page, check the rubric to know how you will be scored for this assignment.
2. Double-check that you have run the code cell near the end of the notebook that contains the command `"grader.check_all()"`. This command will run all of the run tests on all your responses to the auto-graded tasks marked with 📍.
3. Double-check your responses to the manually graded tasks marked with 📍🔎.
3. Select the menu item "File" and "Save Notebook" in the notebook's Toolbar to save your work and create a specific checkpoint in the notebook's work history.
4. Select the menu items "File", "Download" in the notebook's Toolbar to download the notebook (.ipynb) file. 
5. In the related Canvas Assignment page, click Start Assignment or New Attempt to upload the downloaded .ipynb file.

**Keep in mind that the autograder does not always check for correctness. Sometimes it just checks for the format of your answer, so passing the autograder for a question does not mean you got the answer correct for that question.**

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()