# Homework 3 — Subsetting and Plotting Data: Understanding Time Use
 


# Introduction

For this week's homework, we are going to continue to work with the Statistics Canada GSS Time Use Dataset. This time we're going to dig into some of the well-being variables (feeling rushed) and respondent characteristic variables (how people commute to work).


# Question

The question you're answering in this homework:

> Do urban resident respondents who drive a car to commute (either some or all of the time) feel less rushed than those respondents who don't drive a car for commuting?

# Lab Instructions and Learning Objectives

Just like in the previous homework, you will be creating and submitting a data story answering a data science question. You will be required to submit your work in the same format as last time, complete with sections for *Introduction*, *Data*, *Methods*, *Computation*, and *Conclusion*.

In this lab, you will:
* Create a data story in a notebook exploring the question.
* Work with the Time Use dataset from lecture to investigate if commuting by car affects feelings of being rushed.
* Write and use Boolean expressions to focus on specific observations in our dataset. (That means subsetting `DataFrame`s using `.loc` and `.iloc`).
* Create and name new columns, and use Boolean expressions to assign new values based on values in existing columns.
* Produce and interpret a crosstabulation table to compare how respondents in two different categories differ based on two other Boolean variables. 
* Plot the results of your crosstabulation, and use the visualization to describe general trends.


# Due date 

You will submit your completed Homework 3 on MarkUs by *Mon, Jan 31 2021 at 11:59 PM EST*. We will send an announcement in a couple days when autotesting has been set up on MarkUs.

# GGR: How to submit

1. Download your homework to your local computer and save it as `GGR274_Homework_3.ipynb`.
2. Log in here: https://markus-ds.teach.cs.toronto.edu.
3. Submit your homework to `HW3: Homework 3`.

# Marking Rubric


Section     | 0 | 1 | 2 | 3
------------|---|---|---|---
Introduction|The question is not stated correctly or left blank | The question is stated correctly | NA | NA 
Data (for each python variable)       |auto test fails | auto test passes | NA | NA 
Methods (for each part) | No answer | The data extracted is specified or a reasonable rationale is given, but not both | Both the data extracted is specified and a reasonable rationale is given | NA
Computation |auto test fails | auto test passes | NA | NA 
Conclusion (for each part) | No answer | The question is answered but no explanation is given | The question is answered but the explanation is not supported or weakly supported by the data | The question is answered and the explanation is supported by the data 

Maximum grade: **35**


# Introduction section

This should introduce the question being explored in a sentence. __(1 mark)__

# Data section

The `Data` part of your notebook should read the raw data, extract a `DataFrame` containing the important columns, and present the overall data. Create at least these three variables. (You might find it helpful to create other variables to name intermediate values.)

+ `time_use_data_raw`: the `DataFrame` created by reading the `gss_tu2016_main_file.csv` file. __(1 mark)__
+ `time_use_data`: the `DataFrame` containing only the relevant columns from the raw data: the `'CASEID'`, `'luc_rst'`, `'gtu_110'`, and `'ctw_140a'`, columns. __(1 mark)__
+ `new_column_names`: the dictionary mapping the column names from `time_use_data` to the values `'case_ID'`, `'urban_rural'`, `'feeling_rushed'`, and `'commute_driver'`, respectively. __(1 mark)__
+ `clean_time_use_data`: the `DataFrame` that is the result of renaming the columns in `time_use_data`, using the dictionary `new_column_names`. (We will not autotest this `DataFrame` until you have added columns, as described below.)


We will check the value of these variables in the autotester. You'll probably want to use a few other variables along the way for the intermediate steps, like naming a list of important columns, but we're not autotesting those.

The `Data` part of your notebook should read the raw data, extract a `DataFrame` containing the important columns, and present the overall data. Make sure to select the columns in the order as specified above.

Here is some code for you to use to check your variable values. Copy and paste these cells into your notebook at the end of the appropriate section. For example, these print statements should go in a cell at the end of the Data section:

In [None]:
# Data check
print("time_use_data_raw:")
print(time_use_data_raw)
print("time_use_data:")
print(time_use_data)
print("new_column_names:")
print(new_column_names)
print("clean_time_use_data:")
print(clean_time_use_data)

# Methods section

Start with a Markdown cell describing what you're going to do, which is:

1. Filter the data (make a new `DataFrame` containing a subset of the rows) to include only urban respondents. What variable in our dataset contains this information? Why are we interested in this subset? Explain in a few sentences. __(2 marks)__
2. Create a new column that codes whether or not someone feels rushed more often than once a week. What values does this new column rely upon, what does each value represent, and what is its Python type? __(2 marks)__
3. Create a new variable that codes whether or not someone has reported to have driven a car to work. What values does this new column take, what does each value represent, and what is the Python type? __(2 marks)__
4. Compare how many respondents feel rushed more than at least once a week, for the group of respondents who do not commute by car vs. those who do. What data are we using, and why? Explain in a few sentences. __(2 marks)__
5. Use a visualization to describe your results in 4.  What visualization is most appropriate, and why? How would this visualization be interpreted? Explain in a few sentences. __(2 marks)__

# Computation section

There are a few sections to this, as outlined in the Methods. First, you will subset your dataset for urban respondents. Then you'll add a column that indicates whether a respondent feels rushed more often than once a week, and another column that indicates whether a respondent commutes via car. Finally, we will analyze whether respondents who commute via car feel rushed more often than those who do not commute via car.

## Subset Data

First, let's subset our data to include only respondents who live in Urban areas. 

Create these variables along the way. We will check them in the autotester. We will not check your intermediate steps.

+ `urban_respondents_only`: a Boolean `Series` that is `True` when a respondent lives in an Urban area, and `False` otherwise. __(1 mark)__ 
    * Hint: you might want to refer to Lecture 3 or the dataset codebook to determine which values correspond to a respondent in an Urban area
+ `urban_subset_time_use`: a `DataFrame` that contains only Urban respondents. __(1 mark)__



In [None]:
# Subest Data check
print("urban_respondents_only:")
print(urban_respondents_only)
print("urban_subset_time_use:")
print(urban_subset_time_use)

## Create new columns

Let's add two new columns to our dataset of urban respondents. These columns will contain Boolean values.

Create the following variables along the way. We will check them in the autotester. We will not check your intermediate steps.

### Feeling rushed

+ `feels_rushed_true`: a Boolean `Series` that is `True` when a respondent feels rushed at least once a week, and `False` otherwise.  __(1 mark)__
+ `feels_rushed_false`: a Boolean `Series` that is `True` when a respondent feels rushed less often than once a week, and `False` otherwise. __(1 mark)__

Use those two variables to create a new column called `'feels_rushed_YN'` in `urban_subset_time_use` that is `True` when a respondent feels rushed at least once a week and `False` otherwise.

In [None]:
# Feeling rushed check
print("feels_rushed_true:")
print(feels_rushed_true)
print("feels_rushed_false:")
print(feels_rushed_false)
clean_time_use_data

### Commute by car

+ `commute_car_true`: a Boolean `Series` that is `True` when a respondent has indicated that they commute via car, and `False` otherwise.  __(1 mark)__
+ `commute_car_false`: a Boolean `Series` that is `True` when a respondent has indicated that they do NOT commute via car, and `False` otherwise. __(1 mark)__

Use those two variables to create a new column in `clean_time_use_data` called `'commute_car_YN'` that is `True` when a respondent has commuted via car and `False` otherwise.

In the autotester, we will examine `urban_subset_time_use` after both columns have been added. __(1 mark)__

In [None]:
# Commute by car check
print("commute_car_true:")
print(commute_car_true)
print("commute_car_false:")
print(commute_car_false)
clean_time_use_data

## Create a crosstabulation

Let's create a crosstabulation to compare respondents between our two new columns, `'feels_rushed_YN'` and `'commute_car_YN'`.

Create the following variables along the way. We will check them in the autotester. We will not check your intermediate steps.

+ `columns_to_crosstab`: a `DataFrame` that contains only the columns `'feels_rushed_YN'` and `'commute_car_YN'` extracted from `clean_time_use_data`. __(1 mark)__
+ `feels_rushed_commute_car_crosstab`: a crosstabulation, using the columns in `columns_to_crosstab`. __(1 mark)__


In [None]:
# Crosstabulation check
print("columns_to_crosstab: ")
print(columns_to_crosstab)
print("feels_rushed_commute_car_crosstab: ")
print(feels_rushed_commute_car_crosstab)

## Plot your results

Finally, we can visually analyze the results of the crosstabulation. 

Create a bar plot of the `feels_rushed_commute_car_crosstab` crosstabulation using `.plot.bar()` and name it `crosstab_barplot`. __(1 mark)__ 



In [None]:
print("crosstab_barplot: ")
print(crosstab_barplot)

# Conclusion

Include cells with your answers to each of these questions:
 
1. Do respondents who commute via car feel rushed more often than those who don't commute by car? Briefly explain. (__3 marks__)
2. What do the values in the crosstabulation represent? Use these values and your visualization to draw at least one conclusion about the relationship between commuting and feeling rushed.  Briefly explain how you arrived at your conclusions. __(3 marks)__
3. Think about what aspects of commuting can lead a person to 'feel rushed'. Propose two or three specific potential causes for car and/or non-car commuters. __(3 marks)__

# BEFORE YOU SUBMIT: rerun your whole notebook!

Before you submit, re-run all the Code cells in your notebook from top to bottom and read the output carefully to make sure there are no unexpected errors.