# GGR274 Lab 5: Data Transformations, Grouped Data, and Data Visualization

## Logistics

Like last week, our lab grade will be based on attendance and submission of a few small tasks to MarkUs during the lab session (or by 23:59 on Thursday).

Complete the tasks in this Jupyter notebook and submit your completed file to [MarkUs](https://markus.teach.cs.toronto.edu/markus/main/login_remote_auth).
Here are the instructions for submitting to MarkUs (same as last week):

1. Download this file (`Lab_5.ipynb`) from JupyterHub. (See [our JupyterHub Guide](../../../guides/jupyterhub_guide.ipynb) for detailed instructions.)
2. Submit this file to MarkUs under the **lab5** assignment. (See [our MarkUs Guide](../../../guides/markus_guide.ipynb) for detailed instructions.)

Note: Use autotests with this week"s lab to see if you are on the right track. It's important to follow the steps so your answers match the solution in not only the way they appear on screen, but also in data types, in white spaces, in rounding, etc.

## Lab 5 Introduction

In this lab and in the homework, we will continue working with the PUMF Census data. Because the full data set is so large you will be working with a subset of the data in the file name `"pumf_age_employment.csv"`.

The goal today is to create a new column that groups applies labels to broader age-range categories, and then use these categories to look at employment income across the categories.

Finally we will plot some box plots and bar graphs to compare the data. 

As usual, these labs are meant to facilitate your understanding of the material from lectures in a low-stakes environment. Please feel free to refer to your lecture content, collaborate with your peers, and seek out help from your TAs.

## Task 1

a) Read CSV file `"pumf_age_employment.csv"` into a pandas `DataFrame` named `age_data``.

In [None]:
import pandas as pd

# write your solution below


# check your work
age_data.head()

b) To make it easier to see the results going forward, set `age_data` to contain only the columns we are interested in:  `HH_ID`, `AGEGRP`, and `EMPIN`.

In [None]:
# write your solution below

#check your work
age_data.head()

## Task 2


a) Create a new column in `age_data` named `"age_bin"`.  The values of `"age_bin"` should be obtained from the `"AGEGRP"` column in `age_data` which has the values:

| Code | Description         |
|------|---------------------|
| 1    | 0 to 9 years        |
| 2    | 10 to 14 years      |
| 3    | 15 to 19 years      |
| 4    | 20 to 24 years      |
| 5    | 25 to 29 years      |
| 6    | 30 to 34 years      |
| 7    | 35 to 39 years      |
| 8    | 40 to 44 years      |
| 9    | 45 to 49 years      |
| 10   | 50 to 54 years      |
| 11   | 55 to 64 years      |
| 12   | 65 to 74 years      |
| 13   | 75 years and over   |
| 88   | Not available       |

Note that there are no entries where AGEGRP has codes 1 or 2. 

`"age_bin"` should have the values `"youth"`, `"young adult"`, `"middle adult"`, `"senior"` defined as :

- `"youth"`  : ages 15-24
- `"young adult"`  : ages 25-44
- `"middle-aged"` : ages 45-64
- `"senior"` : ages 65+

Using boolean conditions, identify which rows correspond to youth, young adults, middle-aged, and seniors. Then use `.loc` to assign these descriptive age categories to a new column called `age_bin`. I would recommend creating one boolean Series for each age category and apply each onein turn to the column. You may want to review boolean Series from week 4.

Recall from class that `df.loc[row_selector, column] = value` will assign `value` to the column `column` that all the rows in boolean Series `row selector` that are True.

In [None]:
import numpy as np

# write your code below






# Check your result by displaying the first 5 lines of the file. The columns of # interest are AGEGRP and EMPIN
age_data.head()

b) Compute the distribution of `age_bin` as counts using `.value_counts()`, and store the count distribution in `age_bin_count_dist`. Then compute `age_bin` as a proportion of the total population, and store this in `age_bin_prop_dist`.

In [None]:
# write your code below


# check your work
age_bin_count_dist

In [None]:
# write your code below



# check your work
age_bin_prop_dist

Next we will sort the values of `age_bin_prop_dist` in ascending order (smallest to largest) using the `sort_values` method. Run the code below:
```

> **(Not graded)** The `inplace=True` parameter in `sort_values` modifies `age_bin_prop_dist`. What do you predict would happen to `age_bin_prop_dist` if we used `age_bin_prop_dist.sort_values(ascending=True, inplace=False)` instead? 

In [None]:
# write your code below



age_bin_prop_dist   

> `age_bin_prop_dist.sort_values(ascending=True, inplace=False)` will return a `pd.Series` with the values sorted. However, unlike using `inplace=True`, it will not update the values stored in `age_bin_prop_dist`.

d) **(Not graded)** Create a bar plot of `age_bin_prop_dist`. 

_Feel free to explore different aesthetic options by changing paramters for the plotting function. (See the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html).)_

In [None]:
# Write your code below

## Task 3

a) We are going to look at employment income data, but first we have to convert the values that stand in for "not available" and "not applicable".

We will do that using the `.replace(toreplace, newvalue)` method. For the first argument, we can pass in a single variable, or a list of values that we want to replace.  The second argument is our new value which in this case will be `np.nan`.

In [None]:
import numpy as np
# solution


#check your work to make sure you see NaN in the EMPIN column
age_data.head()

b) Create and store a boxplot of `Employment Income` by `age_bin` to `income_by_age_boxplots` by completing the code below. 

1. Use `figsize=(8, 8)` inside the `pandas.DataFrame.boxplot()` function; 
2.  Set the label on the x-axis to `Age Group` by using the `.set_xlabel()` method, as follows:
```python
income_by_age_boxplots.set_xlabel("Age Group")
```
3. Set the label on the y-axis to `Employment Income` by usign the `.set_ylabel()` method, as follows:
```python
income_by_age_boxplots.set_ylabel("Employment Income")
```

In [None]:
# Solution

income_by_age_boxplots = age_data.boxplot(
    column=
    by=
    figsize=
)

# add the axis labels

# in case you don't see the plot without an error, try running the code below.
# income_by_age_boxplots.figure 


c) **(Not graded)** Feel free to customize a copy of the plot, `income_by_age_boxplots_copy`, further to your liking with the help of the [documention](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html).

For example, you migth want to order the categories from youngest to oldest (or oldest to youngest).

Further customization. See [documentation on `pandas.Categorical`](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html) for more information on the method.

In [None]:
# Write your code here
income_by_age_boxplots_copy = income_by_age_boxplots



# in case you don't see the plot without an error, try running the code below.
income_by_age_boxplots_copy.figure