# Tutorial 3: Cleaning and Wrangling Data

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* define the term "tidy data"
* explain when chaining is appropriate and demonstrate chaining over multiple lines and verbs.
* discuss the advantages and disadvantages of storing data in a tidy data format
* recall and use the following functions and methods for their intended data wrangling tasks:
    - Use `loc[]` to select rows and columns.
    - Use `[]` to filter rows of a data frame.
    - Create new or columns in a data frame using `.assign` notation.
    - Use `.groupby` to calculate summary statistics on grouped objects
    - Use `.melt` and `.pivot` to reshape data frames, specifically to make tidy data.

This tutorial covers parts of [Chapter 3](https://python.datasciencebook.ca/wrangling.html) of the online textbook. You should read this chapter before attempting the worksheet

Any place you see `___`, you must fill in the function, variable, or data to complete the code. Replace `raise NotImplementedError` with your completed code and answers then proceed to run the cell!

In [None]:
### Run this cell before continuing.
import altair as alt
import pandas as pd

**Question 0.1** 
<br> {points: 1}

Match the following definitions with the corresponding functions used in Python:

A. Reads files that have columns separated by commas. 

B. Most data operations are done on groups defined by variables. This function takes an existing data set and converts it into a grouped data set where operations are performed "by group". 

C. "Lengthens" data, increasing the number of rows and decreasing the number of columns.

D. Creates a new column in a dataframe. 


*Functions*

1. `groupby`
2. `pd.read_csv`
3. `assign`
4. `melt`

*For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list of functions. For example:*
```
A = 1
B = 2
...
D = 4
```


In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(A)).encode("utf-8")+b"f2df8").hexdigest() == "feb770f45f38b27ef713059361778637b6ce2123", "type of A is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(A).encode("utf-8")+b"f2df8").hexdigest() == "3f9048cafc40b396bd93c46829bcfc0e56366694", "value of A is not correct"

assert sha1(str(type(B)).encode("utf-8")+b"f2df9").hexdigest() == "2d982c2d07d04148e21d4618fe10ac6fb4209cc3", "type of B is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(B).encode("utf-8")+b"f2df9").hexdigest() == "24e9f6238c911b0b268247026ba5a5decc7a72a8", "value of B is not correct"

assert sha1(str(type(C)).encode("utf-8")+b"f2dfa").hexdigest() == "575b399b3d23c774cbd5a59f7c9ef7a410ba17a5", "type of C is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(C).encode("utf-8")+b"f2dfa").hexdigest() == "4e596188a2769dce6b9dcd8f70c5d76454296747", "value of C is not correct"

assert sha1(str(type(D)).encode("utf-8")+b"f2dfb").hexdigest() == "989e2f469bdc3a9edb7403986036bfedff55ac11", "type of D is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(D).encode("utf-8")+b"f2dfb").hexdigest() == "594f77e079c41ec9961090edf3481ce7b647cd1c", "value of D is not correct"

print('Success!')

## 1. Historical Data on Avocado Prices 
In the tutorial, we will be finishing off our analysis of the avocado data set. 

You might recall from the lecture that millennials LOVE avocado toast. However, avocados are expensive and this is costing millennials a lot more than you think (joking again 😉, well mostly...). To ensure that they can save enough to buy a house, it would be beneficial for an avocado fanatic to move to a city with low avocado prices. From `worksheet_wrangling` we saw that the price of the avocados is less in the months between December and May, but we still don't know which region contains the cheapest avocados. 

<img align="left" src="https://media.giphy.com/media/8p3ylHVA2ZOIo/giphy.gif" width="200"/>

*image source: https://media.giphy.com/media/8p3ylHVA2ZOIo/giphy.gif*


As a reminder, here are some relevant columns in the dataset:

- `average_price` - The average price of a single avocado.
- `type` - conventional or organic
- `year` - The year
- `region` - The city or region of the observation
- `small_hass_volume`	
- `large_hass_volume`	
- `extra_l_hass_volume`	

Additionally, the last three columns can be used to calculate `total_volume` in pounds (lbs). The goal for today is to find the region with the cheapest avocados and then produce a plot of the total number of avocados sold against the average price per avocado **(in US dollars)** for that region. To do this, you will follow the steps below. 

1. use a pandas `read_*` function to load the csv file into your notebook
2. use `groupby` to find the region with the cheapest avocados. 
3. use `loc[]` to specifically look at data from the region of interest. 
4. use `assign` to add up the volume for all types of avocados (small, large, and extra)
5. use `altair` to create our plot of volume vs average price


**Question 1.1** 
<br> {points: 1}

Read the file `avocado_prices.csv` found in the `data` directory using a relative path. 

*Assign your answer to an object called `avocado`.* 

In [None]:
# your code here
raise NotImplementedError
avocado

In [None]:
from hashlib import sha1
assert sha1(str(type(avocado is None)).encode("utf-8")+b"a5256").hexdigest() == "602065c59ad230d248d7ebc587290e557218bb92", "type of avocado is None is not bool. avocado is None should be a bool"
assert sha1(str(avocado is None).encode("utf-8")+b"a5256").hexdigest() == "15df45833ad752a8827d7486226b1d7d799925c7", "boolean value of avocado is None is not correct"

assert sha1(str(type(avocado)).encode("utf-8")+b"a5257").hexdigest() == "46d212a3929b18a30325a039fac4359d7d1ddbfa", "type of type(avocado) is not correct"

assert sha1(str(type(avocado.shape)).encode("utf-8")+b"a5258").hexdigest() == "c29182a253bfa86b9da62b25e7b59a1509ac6ed8", "type of avocado.shape is not tuple. avocado.shape should be a tuple"
assert sha1(str(len(avocado.shape)).encode("utf-8")+b"a5258").hexdigest() == "f373c4b2b507f01ea296ab31658106e5a718e0f1", "length of avocado.shape is not correct"
assert sha1(str(sorted(map(str, avocado.shape))).encode("utf-8")+b"a5258").hexdigest() == "71b8c8d7e8fb1e33515ffd57f6bd07d65092145f", "values of avocado.shape are not correct"
assert sha1(str(avocado.shape).encode("utf-8")+b"a5258").hexdigest() == "b01388920ea5d03ac6315491118ca57ca1a9384c", "order of elements of avocado.shape is not correct"

assert sha1(str(type(avocado.columns.values)).encode("utf-8")+b"a5259").hexdigest() == "12ea1fad15f2bc83dc87c3ac83042cd68fdb7804", "type of avocado.columns.values is not correct"
assert sha1(str(avocado.columns.values).encode("utf-8")+b"a5259").hexdigest() == "f3c35414d78ac81458c2e0130e1f4aacf883cef9", "value of avocado.columns.values is not correct"

print('Success!')

**Question 1.2** 
<br> {points: 1}

Now find the region with the cheapest avocados in 2018. To do this, calculate the average price for each region. Your answer should be the row from a data frame with the lowest average price. The data frame you create should have two columns, one named `region` that has the region, and the other that contains the average price for that region.

*Assign your answer to an object called `cheapest`.*



In [None]:
# cheapest = (
#     (
#         ___[___[___] == ___]
#         .groupby(___)
#         .___
#         .reset_index()
#         .sort_values(by=___)
#     )
#     .head(___)
#     [[___, ___]]
# )

# your code here
raise NotImplementedError
cheapest

In [None]:
from hashlib import sha1
assert sha1(str(type(cheapest is None)).encode("utf-8")+b"bcc54").hexdigest() == "0b2101c46782ac234a6fd78d5b81079af544cd47", "type of cheapest is None is not bool. cheapest is None should be a bool"
assert sha1(str(cheapest is None).encode("utf-8")+b"bcc54").hexdigest() == "5d011ac85feecf79a7acf602e0f6b1e5c2ae775e", "boolean value of cheapest is None is not correct"

assert sha1(str(type(cheapest)).encode("utf-8")+b"bcc55").hexdigest() == "9e91fb3321a2fa337da4a8de0babf06dcdbd38a0", "type of type(cheapest) is not correct"

assert sha1(str(type(cheapest.shape)).encode("utf-8")+b"bcc56").hexdigest() == "759f3295c59c5fa32dd2e1a22997e1766069c5fb", "type of cheapest.shape is not tuple. cheapest.shape should be a tuple"
assert sha1(str(len(cheapest.shape)).encode("utf-8")+b"bcc56").hexdigest() == "48ea058d0e05af51fa498654920711a9ff80f1ee", "length of cheapest.shape is not correct"
assert sha1(str(sorted(map(str, cheapest.shape))).encode("utf-8")+b"bcc56").hexdigest() == "6edf45b70ea3144aadce518a757c9513e3a48b04", "values of cheapest.shape are not correct"
assert sha1(str(cheapest.shape).encode("utf-8")+b"bcc56").hexdigest() == "b725ac24e913ad1fd5f40348886f3569fe1423ee", "order of elements of cheapest.shape is not correct"

assert sha1(str(type(cheapest.region.iloc[0])).encode("utf-8")+b"bcc57").hexdigest() == "5e63c4a566db3cb892ca7c96e3e74e118ae31d96", "type of cheapest.region.iloc[0] is not str. cheapest.region.iloc[0] should be an str"
assert sha1(str(len(cheapest.region.iloc[0])).encode("utf-8")+b"bcc57").hexdigest() == "4ec6fee13ec1d134d1f3311aa073703805dc6a75", "length of cheapest.region.iloc[0] is not correct"
assert sha1(str(cheapest.region.iloc[0].lower()).encode("utf-8")+b"bcc57").hexdigest() == "1fc9bd33fc1ff57d15a7aaceb112794f805b351b", "value of cheapest.region.iloc[0] is not correct"
assert sha1(str(cheapest.region.iloc[0]).encode("utf-8")+b"bcc57").hexdigest() == "4b1470da9383f0150384c30217841513424e2a04", "correct string value of cheapest.region.iloc[0] but incorrect case of letters"

assert sha1(str(type(cheapest.drop(columns="region").values.astype(int)[0][0])).encode("utf-8")+b"bcc58").hexdigest() == "1a2679d1a57da2cf2dedc2f39d13b0e2686f85da", "type of cheapest.drop(columns=\"region\").values.astype(int)[0][0] is not correct"
assert sha1(str(cheapest.drop(columns="region").values.astype(int)[0][0]).encode("utf-8")+b"bcc58").hexdigest() == "a1d21c23d43446410909d7c0da527e8157fd03f4", "value of cheapest.drop(columns=\"region\").values.astype(int)[0][0] is not correct"

print('Success!')

**Question 1.3**
<br> {points: 1}

Now we will plot the total volume against average price for the cheapest region and all years. First, you need to mutate the data frame such that `total_volume` is equal to the addition of all three volume columns. Next, filter the dataset using the cheapest region found in **Question 1.2**. Finally, you will have the data necessary to create a scatter plot with: 

- x : `total_volume` 
- y : `average_price`

Fill in the `___` in the cell below. Substitute the `raise NotImplementedError` with your completed code and answers then proceed to run the cell! We have added the code to convert the axes scales to logarithmic, which helps us see both large and small difference in the same chart.

*Assign your answer to an object called `avocado_plot`.* 

> Hint: Do not forget units on your data visualization! Here the price is in US dollars (USD) and the volume in pounds (lbs).

In [None]:
# avocado = avocado.assign(___=___ + ___ + ___

# alt.Chart(___[____[___] == ___]).mark_point().encode(
#     x=alt.X(___)
#         .title(___)
#         .scale(type="log"),
#     y=alt.Y(___).title(___),
# )

# your code here
raise NotImplementedError
avocado_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(avocado_plot is None)).encode("utf-8")+b"3ec89").hexdigest() == "47b2d3930eb8e28a4d9010c695d6bbbdba78dd2f", "type of avocado_plot is None is not bool. avocado_plot is None should be a bool"
assert sha1(str(avocado_plot is None).encode("utf-8")+b"3ec89").hexdigest() == "b4e9a553258148b0e25a3efee3667a4e2c68a2df", "boolean value of avocado_plot is None is not correct"

assert sha1(str(type(avocado_plot.encoding.x['shorthand'])).encode("utf-8")+b"3ec8a").hexdigest() == "ca2896ee04e763df03aad95c2f0221316f54fd0e", "type of avocado_plot.encoding.x['shorthand'] is not str. avocado_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(avocado_plot.encoding.x['shorthand'])).encode("utf-8")+b"3ec8a").hexdigest() == "1e4cf40b6539b44a5982346eab911707a6892764", "length of avocado_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"3ec8a").hexdigest() == "db1bbb82c38ed4abed72a4dda6a8fadc40db5927", "value of avocado_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.x['shorthand']).encode("utf-8")+b"3ec8a").hexdigest() == "db1bbb82c38ed4abed72a4dda6a8fadc40db5927", "correct string value of avocado_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_plot.encoding.y['shorthand'])).encode("utf-8")+b"3ec8b").hexdigest() == "b9ba846d88eddcea2fbb935dc127d36722864704", "type of avocado_plot.encoding.y['shorthand'] is not str. avocado_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(avocado_plot.encoding.y['shorthand'])).encode("utf-8")+b"3ec8b").hexdigest() == "7452b21004a4f77dd4128967059036d8b57510ed", "length of avocado_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"3ec8b").hexdigest() == "c7fe38fa9784f2a71c1967e43edec317a65e21d5", "value of avocado_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.y['shorthand']).encode("utf-8")+b"3ec8b").hexdigest() == "c7fe38fa9784f2a71c1967e43edec317a65e21d5", "correct string value of avocado_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_plot.data.region.unique()[0])).encode("utf-8")+b"3ec8c").hexdigest() == "cfb1be84f7cd1626e6b284f017b53dcc48da185e", "type of avocado_plot.data.region.unique()[0] is not str. avocado_plot.data.region.unique()[0] should be an str"
assert sha1(str(len(avocado_plot.data.region.unique()[0])).encode("utf-8")+b"3ec8c").hexdigest() == "fc2009810cf27d6bcbd523880b86db0c3cd7f70f", "length of avocado_plot.data.region.unique()[0] is not correct"
assert sha1(str(avocado_plot.data.region.unique()[0].lower()).encode("utf-8")+b"3ec8c").hexdigest() == "5ba20c6569980f92e3cdf6352de6c54bcebca728", "value of avocado_plot.data.region.unique()[0] is not correct"
assert sha1(str(avocado_plot.data.region.unique()[0]).encode("utf-8")+b"3ec8c").hexdigest() == "70625a1f4c33056cc158cec56d87f2c2aa90e0de", "correct string value of avocado_plot.data.region.unique()[0] but incorrect case of letters"

assert sha1(str(type(avocado_plot.mark)).encode("utf-8")+b"3ec8d").hexdigest() == "985a6d0f8d2da7f677d55a3aa6f74dbcc4c45c9f", "type of avocado_plot.mark is not str. avocado_plot.mark should be an str"
assert sha1(str(len(avocado_plot.mark)).encode("utf-8")+b"3ec8d").hexdigest() == "46f18521ba0a4ffe9bb9104f09a696961e73fa53", "length of avocado_plot.mark is not correct"
assert sha1(str(avocado_plot.mark.lower()).encode("utf-8")+b"3ec8d").hexdigest() == "2d29391970f52f44dbd1bc64ce80318ade87cf02", "value of avocado_plot.mark is not correct"
assert sha1(str(avocado_plot.mark).encode("utf-8")+b"3ec8d").hexdigest() == "2d29391970f52f44dbd1bc64ce80318ade87cf02", "correct string value of avocado_plot.mark but incorrect case of letters"

assert sha1(str(type(isinstance(avocado_plot.encoding.x['title'], str))).encode("utf-8")+b"3ec8e").hexdigest() == "a2091e0276cb7ce1724e01a04c882cc18a649e64", "type of isinstance(avocado_plot.encoding.x['title'], str) is not bool. isinstance(avocado_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_plot.encoding.x['title'], str)).encode("utf-8")+b"3ec8e").hexdigest() == "d51bead1d49b620642c62e11320984796b451376", "boolean value of isinstance(avocado_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(avocado_plot.encoding.y['title'], str))).encode("utf-8")+b"3ec8f").hexdigest() == "586be4f0d4a722610db51d9ef7ba1e747a94e56f", "type of isinstance(avocado_plot.encoding.y['title'], str) is not bool. isinstance(avocado_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_plot.encoding.y['title'], str)).encode("utf-8")+b"3ec8f").hexdigest() == "4a048fe7a3b26fc9ffa76720f882c843c0eedf5d", "boolean value of isinstance(avocado_plot.encoding.y['title'], str) is not correct"

print('Success!')

**Question 1.4** 

What do you notice? Discuss your plot with the person next to you.

To further investigate this trend, let's colour the data points to see if the type of avocado (either organic or not, which is called conventional in this data set) affects the volume and price of avocados sold in our region of interest. 

Run the cell below to colour the data points by avocado type. 

In [None]:
# Run this cell to see if avocado type (the type variable) plays a role in production and price.

avocado_plot = avocado_plot.encode(color="type")

avocado_plot

**Question 1.4 (Continued)**
<br> {points: 3}

In 2-3 sentences, describe what you see in the graph above. Comment specifically on whether there is any evidence/indication that avocado type might influence price? 

*Hint: Make sure to include information about volume, average price, and avocado type in your answer.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 2. Historical Data on Avocado Prices (Continued)
**Question 2.1**
<br> {points: 3}

Now that we know the region that sold the cheapest avocados (on average) in 2018, which region sold the most expensive avocados (on average) in 2018? And for that region, what role might avocado type play in sales? Repeat the analysis you did above, but now apply it to investigate the region which sold the most expensive avocados (on average) in 2018. 

Remember: we are finding the region that sold the most expensive avocados *in 2018*, but then producing a scatter plot of average price versus total volume sold *for all years*.

*Name your plot object `priciest_plot`.*

> Hint: We recommend you create the data frame `priciest` first and look at that before building the plot. You can always add a new cell with the plus-button. 

In [None]:
# your code here
raise NotImplementedError
priciest_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(priciest_plot is None)).encode("utf-8")+b"27f0f").hexdigest() == "07e559e7f971d3d1c902604248988890d61d09f6", "type of priciest_plot is None is not bool. priciest_plot is None should be a bool"
assert sha1(str(priciest_plot is None).encode("utf-8")+b"27f0f").hexdigest() == "c18557b81394566d6e7132a883f27bb8dc249812", "boolean value of priciest_plot is None is not correct"

assert sha1(str(type(priciest_plot.mark)).encode("utf-8")+b"27f10").hexdigest() == "15142194e0354e1f4191589e139aaab9370f5b0b", "type of priciest_plot.mark is not str. priciest_plot.mark should be an str"
assert sha1(str(len(priciest_plot.mark)).encode("utf-8")+b"27f10").hexdigest() == "747058c49d20dc54af081866fcb50d474c783fc9", "length of priciest_plot.mark is not correct"
assert sha1(str(priciest_plot.mark.lower()).encode("utf-8")+b"27f10").hexdigest() == "c33a97289aac63ff3ce013448c17eb4904183fc5", "value of priciest_plot.mark is not correct"
assert sha1(str(priciest_plot.mark).encode("utf-8")+b"27f10").hexdigest() == "c33a97289aac63ff3ce013448c17eb4904183fc5", "correct string value of priciest_plot.mark but incorrect case of letters"

assert sha1(str(type(priciest_plot.data.region.unique()[0])).encode("utf-8")+b"27f11").hexdigest() == "27355170be251993469ddd23e79965e0154feab6", "type of priciest_plot.data.region.unique()[0] is not str. priciest_plot.data.region.unique()[0] should be an str"
assert sha1(str(len(priciest_plot.data.region.unique()[0])).encode("utf-8")+b"27f11").hexdigest() == "beba478b327c88718c5686a32b2a328cd8b61f9a", "length of priciest_plot.data.region.unique()[0] is not correct"
assert sha1(str(priciest_plot.data.region.unique()[0].lower()).encode("utf-8")+b"27f11").hexdigest() == "66ae8437a7f614d6f7ac8c396a5a0d0cc0eead27", "value of priciest_plot.data.region.unique()[0] is not correct"
assert sha1(str(priciest_plot.data.region.unique()[0]).encode("utf-8")+b"27f11").hexdigest() == "5f125bb94a98164bf840bc9f116fc081f189f79f", "correct string value of priciest_plot.data.region.unique()[0] but incorrect case of letters"

assert sha1(str(type(priciest_plot.data.shape[0])).encode("utf-8")+b"27f12").hexdigest() == "c2e6b1d4e0e33b0fa70587f24594c1b1babb277f", "type of priciest_plot.data.shape[0] is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(priciest_plot.data.shape[0]).encode("utf-8")+b"27f12").hexdigest() == "281a6ff9d504016e5c878de7603ccf4db3acecd5", "value of priciest_plot.data.shape[0] is not correct"

assert sha1(str(type(priciest_plot.encoding.x['shorthand'])).encode("utf-8")+b"27f13").hexdigest() == "7651c17031d669a4cb11f36ab7acf5695e3e5c78", "type of priciest_plot.encoding.x['shorthand'] is not str. priciest_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(priciest_plot.encoding.x['shorthand'])).encode("utf-8")+b"27f13").hexdigest() == "d782822f3dce7a504bda7dbc1b1de5e38f6d3f7c", "length of priciest_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(priciest_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"27f13").hexdigest() == "ccc131294125289c7763e0dca53c883e935e03b8", "value of priciest_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(priciest_plot.encoding.x['shorthand']).encode("utf-8")+b"27f13").hexdigest() == "ccc131294125289c7763e0dca53c883e935e03b8", "correct string value of priciest_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(priciest_plot.encoding.y['shorthand'])).encode("utf-8")+b"27f14").hexdigest() == "7c02f4a2858afacfc4a5137a6ff69c8be099909e", "type of priciest_plot.encoding.y['shorthand'] is not str. priciest_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(priciest_plot.encoding.y['shorthand'])).encode("utf-8")+b"27f14").hexdigest() == "453c4ac20b89f3e6f4ee7042101b3b75ddf371f3", "length of priciest_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(priciest_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"27f14").hexdigest() == "6b791e9ca21db7cb704302e04d2ed5c81506b745", "value of priciest_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(priciest_plot.encoding.y['shorthand']).encode("utf-8")+b"27f14").hexdigest() == "6b791e9ca21db7cb704302e04d2ed5c81506b745", "correct string value of priciest_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(priciest_plot.encoding.color['shorthand'])).encode("utf-8")+b"27f15").hexdigest() == "98dc9137aea3aa82bf161065dcfdfb6a19fdd2c1", "type of priciest_plot.encoding.color['shorthand'] is not str. priciest_plot.encoding.color['shorthand'] should be an str"
assert sha1(str(len(priciest_plot.encoding.color['shorthand'])).encode("utf-8")+b"27f15").hexdigest() == "0709ffbf76fb7632b0c0e83db7b741288491358b", "length of priciest_plot.encoding.color['shorthand'] is not correct"
assert sha1(str(priciest_plot.encoding.color['shorthand'].lower()).encode("utf-8")+b"27f15").hexdigest() == "ae0f004809f64103cddb8a9d581fdb6c33add675", "value of priciest_plot.encoding.color['shorthand'] is not correct"
assert sha1(str(priciest_plot.encoding.color['shorthand']).encode("utf-8")+b"27f15").hexdigest() == "ae0f004809f64103cddb8a9d581fdb6c33add675", "correct string value of priciest_plot.encoding.color['shorthand'] but incorrect case of letters"

assert sha1(str(type(isinstance(priciest_plot.encoding.x['title'], str))).encode("utf-8")+b"27f16").hexdigest() == "e9f190e818872165df47d3c24b51df4525462e89", "type of isinstance(priciest_plot.encoding.x['title'], str) is not bool. isinstance(priciest_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(priciest_plot.encoding.x['title'], str)).encode("utf-8")+b"27f16").hexdigest() == "b0693812c80f36d6a2012338ef371ed4d431e0a0", "boolean value of isinstance(priciest_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(priciest_plot.encoding.y['title'], str))).encode("utf-8")+b"27f17").hexdigest() == "380e65c20abacdcf8970929498e6dd7022970eff", "type of isinstance(priciest_plot.encoding.y['title'], str) is not bool. isinstance(priciest_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(priciest_plot.encoding.y['title'], str)).encode("utf-8")+b"27f17").hexdigest() == "88603126759c8520abbd46356475b963bb50a10e", "boolean value of isinstance(priciest_plot.encoding.y['title'], str) is not correct"

print('Success!')

**Question 2.2**
<br> {points: 3}

In 2-3 sentences, describe what you see in the graph above for the region with the most expensive avocados (on average). Comment specifically on whether there is any evidence/indication that avocado type might influence price.

*Hint: Make sure to include information about volume, average price, and avocado type in your answer.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.3**
<br> {points: 3}

Add new code cells in JupyterLab and plot the scatterplots for the two regions so that they are in adjacent cells (so it is easier for you to compare them). Compare the price and volume data across the two regions. Then argue for or against the following hypothesis:

"*the region that has the cheapest avocados has them because it sells less of the organic (expensive) type of avocados compared to conventional cheaper ones.*"

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 3. Sea Surface Salinity in Departure Bay
As mentioned in this week's Worksheet, Canada's Department of Fisheries and Oceans (DFO) compiled environmentally essential data from 1914 to 2018. The data was collected at the Pacific Biological Station (Departure Bay). Daily sea surface temperature (degrees Celsius) and salinity (practical salinity units, PSU)observations have been carried out at several locations on the coast of British Columbia. The number of stations reporting at any given time has varied as sampling has been discontinued at some stations, and started or resumed at others.

In **`worksheet_wrangling`** we already worked with the temperature observations. Now, we will be focusing on salinity! Specifically, we want to see if the monthly maximum salinity has been changing over the years. We will only be focusing our attention on the winter months December, January and February. 

**Question 3.1**
<br> {points: 1}

To begin working with this data, read the file `max_salinity.csv` into Python. Note, this file (just like the avocado data set) is found within the `data` folder. 

*Assign your answer to an object called `sea_surface`.* 

In [None]:
# your code here
raise NotImplementedError
sea_surface

In [None]:
from hashlib import sha1
assert sha1(str(type(sea_surface is None)).encode("utf-8")+b"38a14").hexdigest() == "f3cd6fb08fe85eda413170037c1f3ed3395ff180", "type of sea_surface is None is not bool. sea_surface is None should be a bool"
assert sha1(str(sea_surface is None).encode("utf-8")+b"38a14").hexdigest() == "555eab9c17fb58a2a83f6eea6edf61fbc6a1e171", "boolean value of sea_surface is None is not correct"

assert sha1(str(type(sea_surface)).encode("utf-8")+b"38a15").hexdigest() == "7ff99845868f5e2bd5a1416c84d906d60d38fd5a", "type of type(sea_surface) is not correct"

assert sha1(str(type(sea_surface.shape)).encode("utf-8")+b"38a16").hexdigest() == "a01a241ec3361b0b05027fd3601526413a55e0db", "type of sea_surface.shape is not tuple. sea_surface.shape should be a tuple"
assert sha1(str(len(sea_surface.shape)).encode("utf-8")+b"38a16").hexdigest() == "bca2c3da7d211eda6554e6b030697103cb2cb8c0", "length of sea_surface.shape is not correct"
assert sha1(str(sorted(map(str, sea_surface.shape))).encode("utf-8")+b"38a16").hexdigest() == "9887a193ad7f0c4efb45cab990d0187b5846e26c", "values of sea_surface.shape are not correct"
assert sha1(str(sea_surface.shape).encode("utf-8")+b"38a16").hexdigest() == "a1e2b46a267a5adeede3f8f2183d9a5e9daf7da0", "order of elements of sea_surface.shape is not correct"

assert sha1(str(type(sea_surface.columns.values)).encode("utf-8")+b"38a17").hexdigest() == "79e36bc02fe8039c2876724a40c47bb2f37aaef5", "type of sea_surface.columns.values is not correct"
assert sha1(str(sea_surface.columns.values).encode("utf-8")+b"38a17").hexdigest() == "a2c2bfd5a9087608e55cbe2340ba6d8ce24216b3", "value of sea_surface.columns.values is not correct"

print('Success!')

**Question 3.2**
<br> {points: 3}

Given that `altair` prefers tidy data, we must tidy the data! Use the `melt` function to create a tidy data frame with three columns: `Year`, `Month` and `Salinity`. Remember we only want to look at the winter months (December, January and February) so don't forget to reduce the data to just those three!

*Assign your answer to an object called `max_salinity`.*

In [None]:
# ___ = sea_surface[[___, ___, ___, ___]].melt(
#     id_vars="Year",
#     var_name=___,
#     value_name=___
# )

# your code here
raise NotImplementedError
max_salinity

In [None]:
from hashlib import sha1
assert sha1(str(type(max_salinity is None)).encode("utf-8")+b"1da19").hexdigest() == "cec98c4e976287d0e5c386e7067b03fba5bc455b", "type of max_salinity is None is not bool. max_salinity is None should be a bool"
assert sha1(str(max_salinity is None).encode("utf-8")+b"1da19").hexdigest() == "f574cbf2b3bee981b021e3e2f8a2db7daf25c1ba", "boolean value of max_salinity is None is not correct"

assert sha1(str(type(max_salinity.shape)).encode("utf-8")+b"1da1a").hexdigest() == "27b9a6c300c778b66e6ba8b6055a78e130feb159", "type of max_salinity.shape is not tuple. max_salinity.shape should be a tuple"
assert sha1(str(len(max_salinity.shape)).encode("utf-8")+b"1da1a").hexdigest() == "aaedc5fbc4d5384bc033d3f68a0c72039ac6e550", "length of max_salinity.shape is not correct"
assert sha1(str(sorted(map(str, max_salinity.shape))).encode("utf-8")+b"1da1a").hexdigest() == "5e93096aae1b815903d12620171bac174d3cc613", "values of max_salinity.shape are not correct"
assert sha1(str(max_salinity.shape).encode("utf-8")+b"1da1a").hexdigest() == "ea5d9d53b5c7661edc09fa819ecb594b0f8ce8c7", "order of elements of max_salinity.shape is not correct"

assert sha1(str(type("".join(sorted(max_salinity.columns.values.tolist())))).encode("utf-8")+b"1da1b").hexdigest() == "0f38ae998c4a7b81468055156a74234e98496977", "type of \"\".join(sorted(max_salinity.columns.values.tolist())) is not str. \"\".join(sorted(max_salinity.columns.values.tolist())) should be an str"
assert sha1(str(len("".join(sorted(max_salinity.columns.values.tolist())))).encode("utf-8")+b"1da1b").hexdigest() == "c58b998dbba62eb798b8d9a085e20dcde54d1f79", "length of \"\".join(sorted(max_salinity.columns.values.tolist())) is not correct"
assert sha1(str("".join(sorted(max_salinity.columns.values.tolist())).lower()).encode("utf-8")+b"1da1b").hexdigest() == "a4782226d609f70a6d6ee74f6db806477731be73", "value of \"\".join(sorted(max_salinity.columns.values.tolist())) is not correct"
assert sha1(str("".join(sorted(max_salinity.columns.values.tolist()))).encode("utf-8")+b"1da1b").hexdigest() == "d3a1cde3ab8e39552daebac5403031d7f5799f11", "correct string value of \"\".join(sorted(max_salinity.columns.values.tolist())) but incorrect case of letters"

assert sha1(str(type(sum(max_salinity.Year))).encode("utf-8")+b"1da1c").hexdigest() == "20c07103b707a240293fc99f2ba6f58d84fcdcce", "type of sum(max_salinity.Year) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(max_salinity.Year)).encode("utf-8")+b"1da1c").hexdigest() == "264850ccdd4e9a6aaf4dac4c65e3c9c9097a3503", "value of sum(max_salinity.Year) is not correct"

print('Success!')

**Question 3.3** 
<br> {points: 3}

Now that we've created new columns, we can finally create our plot that compares the maximum salinity observations to the year they were recorded.

*Assign your answer to an object called `max_salinity_plot`.*

> Hint: do not forget to add units to your axes titles where appropriate! Remember from the data description that salinity is measured in practical salinity units (PSU).

In [None]:
# your code here
raise NotImplementedError
max_salinity_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(max_salinity_plot is None)).encode("utf-8")+b"befd8").hexdigest() == "b9853833e8540bcf8bbf1e239a19de1e7cd5370f", "type of max_salinity_plot is None is not bool. max_salinity_plot is None should be a bool"
assert sha1(str(max_salinity_plot is None).encode("utf-8")+b"befd8").hexdigest() == "fe888fb9fab656929f5431cb88e2530aaabe2809", "boolean value of max_salinity_plot is None is not correct"

assert sha1(str(type(max_salinity_plot.mark)).encode("utf-8")+b"befd9").hexdigest() == "6b281161b9c62861e2d3dfe20889b44311b82393", "type of max_salinity_plot.mark is not str. max_salinity_plot.mark should be an str"
assert sha1(str(len(max_salinity_plot.mark)).encode("utf-8")+b"befd9").hexdigest() == "42c4758ce8d3007ad9b244b43dc75f4148d3f2d8", "length of max_salinity_plot.mark is not correct"
assert sha1(str(max_salinity_plot.mark.lower()).encode("utf-8")+b"befd9").hexdigest() == "4c355e547bd5edbd78fb8331a3afdd7f7e1145aa", "value of max_salinity_plot.mark is not correct"
assert sha1(str(max_salinity_plot.mark).encode("utf-8")+b"befd9").hexdigest() == "4c355e547bd5edbd78fb8331a3afdd7f7e1145aa", "correct string value of max_salinity_plot.mark but incorrect case of letters"

assert sha1(str(type(max_salinity_plot.data.shape)).encode("utf-8")+b"befda").hexdigest() == "181edc34c1e70ecef01bf7a01d263c4e8c9bc44e", "type of max_salinity_plot.data.shape is not tuple. max_salinity_plot.data.shape should be a tuple"
assert sha1(str(len(max_salinity_plot.data.shape)).encode("utf-8")+b"befda").hexdigest() == "6e5350caa514af4fc9515998d63e73325117a7de", "length of max_salinity_plot.data.shape is not correct"
assert sha1(str(sorted(map(str, max_salinity_plot.data.shape))).encode("utf-8")+b"befda").hexdigest() == "7afb712e40780891ca1a949566e7233181e1e799", "values of max_salinity_plot.data.shape are not correct"
assert sha1(str(max_salinity_plot.data.shape).encode("utf-8")+b"befda").hexdigest() == "1a23d8e9265f73f33ecc1acf45d6bd2f47534293", "order of elements of max_salinity_plot.data.shape is not correct"

assert sha1(str(type(max_salinity_plot.encoding.x['shorthand'])).encode("utf-8")+b"befdb").hexdigest() == "6ec3231559104f83cb3736e4afa25b0300b77cae", "type of max_salinity_plot.encoding.x['shorthand'] is not str. max_salinity_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(max_salinity_plot.encoding.x['shorthand'])).encode("utf-8")+b"befdb").hexdigest() == "e3a59fbcee1058ebf66210d0df8a02d09f71f888", "length of max_salinity_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(max_salinity_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"befdb").hexdigest() == "232d6f24c0fb32770a242e1f89f34956a9b15898", "value of max_salinity_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(max_salinity_plot.encoding.x['shorthand']).encode("utf-8")+b"befdb").hexdigest() == "46487231f2436131458db64b9b1b89fac8f22029", "correct string value of max_salinity_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(max_salinity_plot.encoding.y['shorthand'])).encode("utf-8")+b"befdc").hexdigest() == "f3dba1be993b59504ce02a8659f0c76f4fb6d95b", "type of max_salinity_plot.encoding.y['shorthand'] is not str. max_salinity_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(max_salinity_plot.encoding.y['shorthand'])).encode("utf-8")+b"befdc").hexdigest() == "bcd3d3de30d5debf26ed2f8a5173e8db3614a051", "length of max_salinity_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(max_salinity_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"befdc").hexdigest() == "63755f32a5fa3b37596d8ba843fd8f6aced0ace3", "value of max_salinity_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(max_salinity_plot.encoding.y['shorthand']).encode("utf-8")+b"befdc").hexdigest() == "755ef7d5050ecd81414525e02b8e68e60333a2dd", "correct string value of max_salinity_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(isinstance(max_salinity_plot.encoding.x['title'], str))).encode("utf-8")+b"befdd").hexdigest() == "45976a7978bda3bf08dd6f0cdb89f5b0c9248072", "type of isinstance(max_salinity_plot.encoding.x['title'], str) is not bool. isinstance(max_salinity_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(max_salinity_plot.encoding.x['title'], str)).encode("utf-8")+b"befdd").hexdigest() == "7bdcf9565b6e9c51d68608845512826b142877f0", "boolean value of isinstance(max_salinity_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(max_salinity_plot.encoding.y['title'], str))).encode("utf-8")+b"befde").hexdigest() == "6dedd38865004dc7a4548ddfa977183436aa7c2f", "type of isinstance(max_salinity_plot.encoding.y['title'], str) is not bool. isinstance(max_salinity_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(max_salinity_plot.encoding.y['title'], str)).encode("utf-8")+b"befde").hexdigest() == "c483d1e7a183fff74d5f75a16ea359ad90773ce9", "boolean value of isinstance(max_salinity_plot.encoding.y['title'], str) is not correct"

print('Success!')

**Question 3.4**
<br> {points: 3}

In 1-2 sentences, describe what you see in the graph above. Comment specifically on whether there is a change in salinity across time for the winter months and if there is, whether this indicates a postive or a negative relationship for these variables within this data set. If there is a relationship, also comment on its strength and linearity.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 4. Pollution in Madrid
The goal of this analysis (which we started in `worksheet_wrangling`) is to see if pollutants are decreasing (is air quality improving) and also determine which pollutant has decreased the most over the span of 5 years (2001 - 2006). In `worksheet_wrangling` we investigated what happened with the maximum values of each pollutant over time, now we will investigate the average values of each pollutant over time. To do this we will:

1. Calculate the average monthly value for each pollutant for each year. 
2. Create a scatter plot for the average monthly value for each month. Plot these values for each pollutant and each year so that a trend over time for each pollutant can be observed.
3. Now we will look at which pollutant decreased the most between 2001 - 2006 when we look at the average instead of the maximum values.

**Question 4.1** 
<br> {points: 3}

To begin working with this data, read the file `madrid_pollution.csv`. Note, this file (just like the other data sets in this tutorial) is found in the `py_tutorial_wrangling` directory. 

*Assign your answer to an object called `madrid`.* 

In [None]:
# your code here
raise NotImplementedError
madrid

In [None]:
from hashlib import sha1
assert sha1(str(type(madrid is None)).encode("utf-8")+b"d3ea0").hexdigest() == "6428248670ef4c9e40decdf3d5fe878bd2f86d5a", "type of madrid is None is not bool. madrid is None should be a bool"
assert sha1(str(madrid is None).encode("utf-8")+b"d3ea0").hexdigest() == "aa73b6ead81bc52b5d6a70d4d67c8c349c5f7eb9", "boolean value of madrid is None is not correct"

assert sha1(str(type(madrid.shape)).encode("utf-8")+b"d3ea1").hexdigest() == "a136043fdaa1f1d5e3b5729b7e45517a5ece5ff8", "type of madrid.shape is not tuple. madrid.shape should be a tuple"
assert sha1(str(len(madrid.shape)).encode("utf-8")+b"d3ea1").hexdigest() == "f10064a258bb2dfb1bee6cfe9772b7554c337167", "length of madrid.shape is not correct"
assert sha1(str(sorted(map(str, madrid.shape))).encode("utf-8")+b"d3ea1").hexdigest() == "ffcc4232c674dfa91411a8e677876da4f52eded2", "values of madrid.shape are not correct"
assert sha1(str(madrid.shape).encode("utf-8")+b"d3ea1").hexdigest() == "3423c7473242e9a807f7b40922d2f945e6e5a0d1", "order of elements of madrid.shape is not correct"

assert sha1(str(type(madrid.columns.values)).encode("utf-8")+b"d3ea2").hexdigest() == "58af7d2a7eaa5cb79ca1f9a8e99433e5d86ae4c5", "type of madrid.columns.values is not correct"
assert sha1(str(madrid.columns.values).encode("utf-8")+b"d3ea2").hexdigest() == "f99598e64098df3b5845ef6c5579060a217eb31e", "value of madrid.columns.values is not correct"

print('Success!')

Given that we are going to plotting months, which are dates, let's tell Python how they should be ordered. We can do this by changing the month column from a character column to a categorical column and then sorting the column based on the order. 

We will also drop the `date` column as we are interested only in monthly averages. 

In [None]:
# run this cell to order the column month by month (date) and not alphabetically

madrid["month"] = pd.Categorical(
    madrid["month"],
    categories=[
        "January",
        "February",
        "March",
        "April",
        "May",
        "June",
        "July",
        "August",
        "September",
        "October",
        "November",
        "December",
    ],
    ordered=True,
)
madrid = madrid.sort_values("month")
madrid = madrid.drop(columns=["date"])

**Question 4.2**
<br> {points: 3}

Calculate the average monthly value for each pollutant for each year and store that as a data frame. Your data frame should have the following 4 columns:

1. `year`
2. `month`
3. `pollutant`
4. `value`

Name your data frame `madrid_avg`.

In [None]:
# your code here
raise NotImplementedError
madrid_avg

In [None]:
from hashlib import sha1
assert sha1(str(type(madrid_avg is None)).encode("utf-8")+b"e87cb").hexdigest() == "f16b9112862a4b706234fd8ebf4f836a853848c7", "type of madrid_avg is None is not bool. madrid_avg is None should be a bool"
assert sha1(str(madrid_avg is None).encode("utf-8")+b"e87cb").hexdigest() == "5f197650e70f5ab37282d6e22b116499ab18349e", "boolean value of madrid_avg is None is not correct"

assert sha1(str(type(madrid_avg.shape)).encode("utf-8")+b"e87cc").hexdigest() == "8c28d7535ac20f2d0ae75289a423406f363b5da1", "type of madrid_avg.shape is not tuple. madrid_avg.shape should be a tuple"
assert sha1(str(len(madrid_avg.shape)).encode("utf-8")+b"e87cc").hexdigest() == "41e3bad79f8733a0e3496200babfd61197a152bc", "length of madrid_avg.shape is not correct"
assert sha1(str(sorted(map(str, madrid_avg.shape))).encode("utf-8")+b"e87cc").hexdigest() == "531f8a44662c7f575fb65a81a45bd10ab643875a", "values of madrid_avg.shape are not correct"
assert sha1(str(madrid_avg.shape).encode("utf-8")+b"e87cc").hexdigest() == "44221fe74b46cae47e32c1946b8573279e255865", "order of elements of madrid_avg.shape is not correct"

assert sha1(str(type(round(sum(madrid_avg.value.dropna()), 0))).encode("utf-8")+b"e87cd").hexdigest() == "cddae962c8396bad1e1f1c093abb1726a277bd56", "type of round(sum(madrid_avg.value.dropna()), 0) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(round(sum(madrid_avg.value.dropna()), 0), 2)).encode("utf-8")+b"e87cd").hexdigest() == "4b3e9db6b8f14e51f4beb9ecfaff0dfa3899cd04", "value of round(sum(madrid_avg.value.dropna()), 0) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(str(madrid_avg["year"].dtype))).encode("utf-8")+b"e87ce").hexdigest() == "3b5adea27acdd60edc1f5c341cfdc3c84ed6d3ea", "type of str(madrid_avg[\"year\"].dtype) is not str. str(madrid_avg[\"year\"].dtype) should be an str"
assert sha1(str(len(str(madrid_avg["year"].dtype))).encode("utf-8")+b"e87ce").hexdigest() == "f589ebd3b9f656cdfcc3f59071820ea938dd57ec", "length of str(madrid_avg[\"year\"].dtype) is not correct"
assert sha1(str(str(madrid_avg["year"].dtype).lower()).encode("utf-8")+b"e87ce").hexdigest() == "5e5c7d9e1763755f484ac7eacecba8373157a196", "value of str(madrid_avg[\"year\"].dtype) is not correct"
assert sha1(str(str(madrid_avg["year"].dtype)).encode("utf-8")+b"e87ce").hexdigest() == "5e5c7d9e1763755f484ac7eacecba8373157a196", "correct string value of str(madrid_avg[\"year\"].dtype) but incorrect case of letters"

assert sha1(str(type(str(madrid_avg["month"].dtype))).encode("utf-8")+b"e87cf").hexdigest() == "d85392d189bcab1d33b42c8c06f434f261ae6a50", "type of str(madrid_avg[\"month\"].dtype) is not str. str(madrid_avg[\"month\"].dtype) should be an str"
assert sha1(str(len(str(madrid_avg["month"].dtype))).encode("utf-8")+b"e87cf").hexdigest() == "bcc1ecd32ef230db4bc237e8ba2fd865cf189ec4", "length of str(madrid_avg[\"month\"].dtype) is not correct"
assert sha1(str(str(madrid_avg["month"].dtype).lower()).encode("utf-8")+b"e87cf").hexdigest() == "ee902055ecca9a1112969735749581fca4779a4d", "value of str(madrid_avg[\"month\"].dtype) is not correct"
assert sha1(str(str(madrid_avg["month"].dtype)).encode("utf-8")+b"e87cf").hexdigest() == "ee902055ecca9a1112969735749581fca4779a4d", "correct string value of str(madrid_avg[\"month\"].dtype) but incorrect case of letters"

assert sha1(str(type(str(madrid_avg["pollutant"].dtype))).encode("utf-8")+b"e87d0").hexdigest() == "7571336bdd196115a5070a8e1ab29424c113292c", "type of str(madrid_avg[\"pollutant\"].dtype) is not str. str(madrid_avg[\"pollutant\"].dtype) should be an str"
assert sha1(str(len(str(madrid_avg["pollutant"].dtype))).encode("utf-8")+b"e87d0").hexdigest() == "9b75dda42cca52cf68efc9ef1392ee98b9fa129e", "length of str(madrid_avg[\"pollutant\"].dtype) is not correct"
assert sha1(str(str(madrid_avg["pollutant"].dtype).lower()).encode("utf-8")+b"e87d0").hexdigest() == "bb65ebf0671ca8c821b2e6b987f7d2be89e0da6d", "value of str(madrid_avg[\"pollutant\"].dtype) is not correct"
assert sha1(str(str(madrid_avg["pollutant"].dtype)).encode("utf-8")+b"e87d0").hexdigest() == "bb65ebf0671ca8c821b2e6b987f7d2be89e0da6d", "correct string value of str(madrid_avg[\"pollutant\"].dtype) but incorrect case of letters"

assert sha1(str(type(str(madrid_avg["value"].dtype))).encode("utf-8")+b"e87d1").hexdigest() == "2f5258eb644b179b5451a4a322dc2cae75803896", "type of str(madrid_avg[\"value\"].dtype) is not str. str(madrid_avg[\"value\"].dtype) should be an str"
assert sha1(str(len(str(madrid_avg["value"].dtype))).encode("utf-8")+b"e87d1").hexdigest() == "fa1f60e4298cb420c9cfb7c19729c167be6c303a", "length of str(madrid_avg[\"value\"].dtype) is not correct"
assert sha1(str(str(madrid_avg["value"].dtype).lower()).encode("utf-8")+b"e87d1").hexdigest() == "85947f00d8c0f15abee96531b8eaa64d071e0d85", "value of str(madrid_avg[\"value\"].dtype) is not correct"
assert sha1(str(str(madrid_avg["value"].dtype)).encode("utf-8")+b"e87d1").hexdigest() == "85947f00d8c0f15abee96531b8eaa64d071e0d85", "correct string value of str(madrid_avg[\"value\"].dtype) but incorrect case of letters"

print('Success!')

**Question 4.3**
<br> {points: 3}

Create a scatter plot for the average monthly value for each month. Plot these values for each pollutant and each year so that a trend over time for each pollutant can be observed. To do this all in one plot, you are going to want to use a `facet` (makes subplots within one plot when data are "related").

In [None]:
# pollutant_labels = {
#     "BEN": ["Benzene", "(μg/m³)"],
#     "CO": ["Carbon monoxide", "(mg/m³)"],
#     "EBE": ["Ethylbenzene", "(μg/m³)"],
#     "MXY": ["M-xylene", "(μg/m³)"],
#     "NMHC": ["Non-methane hydrocarbons", "(mg/m³)"],
#     "NO_2": ["Nitrogen dioxide", "(μg/m³)"],
#     "NOx": ["Nitrous oxides", "(μg/m³)"],
#     "O_3": ["Ozone", "(μg/m³)"],
#     "OXY": ["O-xylene", "(μg/m³)"],
#     "PM10": ["Particles", "smaller than 10 μm"],
#     "PXY": ["P-xylene", "(μg/m³)"],
#     "SO_2": ["Sulphur dioxide", "(μg/m³)"],
#     "TCH": ["Total hydrocarbons", "(mg/m³)"],
#     "TOL": ["Toluene", "(μg/m³)"],
# }
# madrid_avg = madrid_avg.assign(
#     label=madrid_avg['pollutant'].map(pollutant_labels)   
# )

# ___ = alt.Chart(___).mark_point().encode(
#     x=alt.X(___).title(___),
#     y=alt.Y(___).title(___)
# ).properties(
#     width=200,
#     height=100
# ).facet(
#     column="year",
#     row="label"
# ).resolve_scale(
#     y="independent"
# )

# your code here
raise NotImplementedError
madrid_avg_plot

**Question 4.4**
<br> {points: 3}

By looking at the plots above, which monthly average pollutant levels appear to have decreased over time? Which appear to have increased?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 4.5**
<br> {points: 3}

Now we will look at which pollutant decreased the most between 2001 - 2006 when we look at the average yearly values for each pollutant. Your final result should be a data frame that has at least these two columns: `pollutant` and `yearly_avg_diff` and one row (the most decreased pollutant when looking at yearly average between 2001 - 2006). **Make sure to use the ```madrid_avg``` data frame in your solution.**

*This question is a bit more challenging and there are several different ways to solve it. You will need to make the correct grouping and aggregation before pivoting the dataframe and make sure you use `reset_index` where appropriate. You are free to find other ways to solve the question outside what we teach in class, but this is not necessary.*

In [None]:
# your code here
raise NotImplementedError
madrid_max_diff_avg

**Question 4.6** 
<br> {points: 3}

Did using the average to find the most decreased pollutant between 2001 and 2006 give you the same answer as using the maximum in the worksheet? Is your answer to the previous question surprising? Explain.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Optional Question** 
<br>(for fun and does not count for grades):

Consider doing the same analysis as you did for Question 4.5, except this time calculate the difference as a percent or fold difference (as opposed to absolute difference as we did in Question 4.5). The scales for the pollutants are very different, and so we might want to take this into consideration when trying to answer the question "Which pollutant decreased the most"?

In [None]:
# Your optional answer goes here