# Homework 5 - Summary statistics, Histograms and Boxplots


# Introduction

For this week's homework, we're going to continue to work with the Statistics Canada GSS Time Use Dataset. We'll aim to understand how respondents' sleep duration per night varies within the Canadian population. We'll also focus our analysis on contrasting how sleep durations differ between the subgroup of respondents who feel that they tend to reduce their sleep, and those who do not.

# Question

**_Is there a difference in the amount of time slept between those who feel that they purposely reduce sleep, compared to those who do not?_**


# Instructions and Learning Objectives

Just like in the previous homework, you will be creating and submitting a data story answering a data science question. You will be required to submit your work in the same format as last time, complete with sections for *Introduction*, *Data*, *Methods*, *Computation*, and *Conclusion*.

In this lab, you will:
* Create a data story in a notebook exploring the question.
* Work with the Time Use dataset from lecture to investigate if the mean and median time spent sleeping differ between those who feel that they purposely sleep less, versus those who do not"
* Identify and remove any missing values from our columns of interest
* Visualize and interpret histograms and boxplots that describe the distribution and variation in time slept between groups of respondents

# Due date 

You will submit your completed Homework 5 on MarkUs by *Fri, Feb 18 2022 at 11:59 PM EST*. We will send an announcement in a couple days when autotesting has been set up.

# GGR: How to submit

1. Download your homework to your local computer and save it as `GGR274_Homework_5.ipynb`.
2. Log in here: https://markus-ds.teach.cs.toronto.edu.
3. Submit your homework to `hw5: Homework 5`.

# Marking Rubric


Section     | 0 | 1 | 2 | 3
------------|---|---|---|---
Introduction|The question is not stated correctly or left blank | The question is stated correctly | NA | NA 
Data (for each python variable)       |auto test fails | auto test passes | NA | NA 
Methods (for each part) | No answer | The data extracted is specified or a reasonable rationale is given, but not both | Both the data extracted is specified and a reasonable rationale is given | NA
Computation |auto test fails | auto test passes | NA | NA 
Conclusion (for each part) | No answer | The question is answered but no explanation is given | The question is answered but the explanation is not supported or weakly supported by the data | The question is answered and the explanation is supported by the data 

# Introduction section

This should introduce the question being explored in a sentence. __(1 mark)__

# Data section

The Data part of your notebook should read the raw data, extract a `DataFrame` containing the important columns, rename the columns, and filter out missing values.

You might find it helpful to name intermediate values in your algorithms. That way you can examine them to make sure they have the type you expect and that they look like what you expect. Very helpful when debugging!

## Step 1

Create the following pandas `DataFrame`s:

+ `time_use_data_raw`: the `DataFrame` created by reading the `gss_tu2016_main_file.csv` file. __(1 mark)__


+ `time_use_data`: the `DataFrame` containing the following columns from `time_use_data_raw`: `'CASEID'`, `'dur01'`, `'tcs_130'`. __(1 mark)__ (We test this after any changes are made to it. We do not check the initial value.)

In a markdown cell, after you read in the data and select the relevant columns, describe what each of the selected columns represents. Refer to the codebook to help with your description __(1 mark)__

In [20]:
# Step 1 check your work

assert time_use_data.shape == (17390, 3)

## Step 2

`time_use_data` could use more informative column names. 

Create the following:

+ `time_use_data_new_column_names`: a python dictionary mapping the column names from `time_use_data` to the values `'participant_ID'`, `'time_spent_sleeping_minutes'`, `'reduce_time_sleeping'`. __(1 mark)__

+ `time_use_data_clean`: a new `DataFrame` that is a copy of `time_use_data`, but with the columns renamed using `time_use_data_new_column_names`. __(1 mark)__

In [23]:
# Step 2 check that you have the correct column names

expected_columnnames = ['participant_ID', 'time_spent_sleeping_minutes', 'reduce_time_sleeping']

assert expected_columnnames == list(time_use_data_clean)

## Step 3

We want to remove values from the columns of `'time_spent_sleeping_minutes'`, and `'reduce_time_sleeping'` that have **missing-data codes**.  Refer to the codebook (`gss_tu2016_codebook.txt`) to look up what values of the columns in `time_use_data_clean` are coded as **missing-data codes**.

Create a `DataFrame` named `clean_time_use_data` that removes rows in `time_use_data_clean` where the value of a column has a **missing-data code**. __(1 mark)__

We will check `clean_time_use_data` in the autotester. You'll probably want to use a few other variables along the way for the intermediate steps, like naming a list of important columns, but we're not autotesting those. In future homework, we'll ask for even fewer intermediate steps.

In [35]:
# Step 3 check that clean_time_use_data has expected number of rows

expect_num_rows = 17032

assert expect_num_rows == len(clean_time_use_data)

# Methods section

Start with a Markdown cell describing what you're going to do, which is:

1. Convert the units of time spent sleeping (minutes) from minutes to hours. Briefly explain why we might want to do this conversion? __(2 marks)__

2. Visualize the distributions of time spent sleeping (hours) using a histogram. What are the main features of the histogram that you would describe to a non-technical audience? __(2 marks)__

3. Describe the distribution, including the min, max, mean, 25th and 75th percentile of hours spent sleeping (in a day) by all respondents. What would we expect a valid range of these values to be, that is, what would a reasonable min and max value be? Briefly explain. __(2 marks)__

4. Graphically compare the distributions, using side-by-side boxplots, of time spent sleeping (hours) between respondent who feel that they tend to reduce sleep, and those who do not tend to reduce sleep. Briefly explain why side-by-side boxplots is a reasonable choice to compare the distributions? __(2 marks)__

# Computation section

## Convert minutes to hours

Convert the units of the column `'time_spent_sleeping_minutes'` in `clean_time_use_data` from minutes to hours. 

Create a new column in `clean_time_use_data` named `'time_spent_sleeping_hours'` that is calculated by converting `'time_spent_sleeping_minutes'` from minutes to hours.


In [27]:
# check that the new column is as expected

clean_time_use_data.head()

Unnamed: 0,participant_ID,time_spent_sleeping_minutes,reduce_time_sleeping,time_spent_sleeping_hours
0,10000,510,2,8.5
1,10001,420,2,7.0
2,10002,570,1,9.5
3,10003,510,2,8.5
4,10004,525,1,8.75


## Quantitative summary of distributions

Part of our analysis and data story involves **Exploratory Data Analysis** (EDA), a process in which we iterate between visualizations and numerical analyses to better understand data.  A key part of exploratory data analysis (EDA) are numerical summaries.

We're interested in a numerical summary statistics of how many hours respondents sleep. 

Use the `groupby` function to provide a quantitative summary of the distributions of **only** `'time_spent_sleeping_hours'` by `'reduce_time_sleeping'` using the `describe` function.  `groupby` and `describe` are functions in the `pandas` library.  

Name the output `sleep_summary_byreduce`. __(1 mark)__

In [29]:
# check the result, which should be a DataFrame.

print(f'The type of sleep_summary_byreduce is {type(sleep_summary_byreduce)}')

sleep_summary_byreduce

The type of sleep_summary_byreduce is <class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,time_spent_sleeping_hours,time_spent_sleeping_hours,time_spent_sleeping_hours,time_spent_sleeping_hours,time_spent_sleeping_hours,time_spent_sleeping_hours,time_spent_sleeping_hours,time_spent_sleeping_hours
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
reduce_time_sleeping,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1,6893.0,8.49765,2.246065,0.0,7.25,8.25,9.5,23.833333
2,10139.0,8.843398,2.175352,0.0,7.583333,8.666667,9.916667,24.0


## Visualize distributions

### Histogram

We're interested in understanding the distribution of hours slept by respondent. Visualizing a distribution can help us see how dispersed our data is around our mean and median estimate.

Use the `pandas` `hist` function to plot a histogram of `'time_spent_sleeping_hours'`, by `'reduce_time_sleeping'` and name this histogram `time_spent_sleeping_histogram`. 

The documentation for `pandas.DataFrame.hist` is [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html).  (HINT: you will need to specify the values for parameters `column` and `by` to produce side-by-side histograms. You should also set parameter `sharey = True`) __(1 mark)__

NB: In this section **do not** use `pandas.DataFrame.plot.hist`.

### Compare distributions between groups with boxplot

Use `clean_time_use_data` to create side-by-side boxplots of `'time_spent_sleeping_hours'` by `'reduce_time_sleeping'`. Name this plot `hours_spent_sleeping_boxplot`. __(1 mark)__

NB: In this section **do not** use `pandas.DataFrame.plot.box`

# Conclusion

Include cells with your answers to each of these questions:
 
1. In this assignment you have three different comparisons of time spent sleeping for respondents for people that purposely that reduced sleeping time.  Which comparison is the most informative?  Briefly explain. __(3 marks)__


2. Is there evidence of a difference in time spent sleeping between those that purposely reduced sleeping time?  Briefly explain. __(3 marks)__