# Worksheet 2: Introduction to Reading Data

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* define the following:
    - absolute file path
    - relative file path
    - url
* read data into Python using a relative path and a url
* compare and contrast the following functions:
    - `read_csv` 
    - `read_excel`
* match the following `pandas` `read_*` function arguments to their descriptions:
    - `filepath_or_buffer` 
    - `sep`
    - `names`
    - `skiprows`
* Connect to a database using the `ibis` library's `connect` function.
* List the tables in a database using the `ibis` library's `list_tables` function
* Create a reference to a database table using the `ibis` library's `table` function
* Execute queries to bring data from a database into Python using the `ibis` library's `execute` function
* Use `to_csv` to save a data frame to a `.csv` file


This worksheet covers parts of [Chapter 2](https://python.datasciencebook.ca/reading.html) of the online textbook. You should read this chapter before attempting the worksheet.

In [None]:
### Run this cell before continuing.
import os

import altair as alt
import pandas as pd
import numpy as np

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

In [None]:
### Run this cell before continuing.
try:
    os.remove("data/delay_data.csv")
except:
    None

## 1. Comparing Absolute Paths, Relative Paths, and URLs

**Question 1.1** Multiple Choice:
<br> {points: 1}

If you needed to read a file using an absolute path, what would be the first symbol in your argument `(___)` when using the `pd.read_csv` function?

A. `pd.read_csv(">___")`

B. `pd.read_csv(";___")`

C. `pd.read_csv("___")`

D. `pd.read_csv("/___")`

*Assign your answer to an object called `answer1_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_1)).encode("utf-8")+b"b938c").hexdigest() == "25b9938976b17f98485387e6d52373605bf997ca", "type of answer1_1 is not str. answer1_1 should be an str"
assert sha1(str(len(answer1_1)).encode("utf-8")+b"b938c").hexdigest() == "34c7023db7177033a5e6f44f9b3f404725583e5f", "length of answer1_1 is not correct"
assert sha1(str(answer1_1.lower()).encode("utf-8")+b"b938c").hexdigest() == "050fde729c45eb602c6b2c30226713c6270fbad3", "value of answer1_1 is not correct"
assert sha1(str(answer1_1).encode("utf-8")+b"b938c").hexdigest() == "27176bbaeb0397d3d2c555ad4eba70d2010228b9", "correct string value of answer1_1 but incorrect case of letters"

print('Success!')

**Question 1.2** True or False: 
<br> {points: 1}

The file argument in the `pd.read_csv` function that uses an absolute path can *never* look like that of a relative path?

*Assign your answer to an object called `answer1_2`. Make sure your answer is a boolean (e.g. `True` or `False`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_2)).encode("utf-8")+b"e0200").hexdigest() == "6404651f583570f7f80649ef96302d66bb57292f", "type of answer1_2 is not bool. answer1_2 should be a bool"
assert sha1(str(answer1_2).encode("utf-8")+b"e0200").hexdigest() == "0eaab581221d37581ffe32ee2b1781a634a00d51", "boolean value of answer1_2 is not correct"

print('Success!')

**Question 1.3** 
Match the following paths with the correct path type that they represent:
<br> {points: 1}

*Example Path*

A. `/Users/my_user/Desktop/UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx`

B. `https://www.ubc.ca`

C. `file_1.csv`

D. `/Users/name/Documents/Course_A/homework/my_first_homework.docx`

E. `homework/my_second_homework.docx`

F. `https://www.random_website.com`


*Path Type*

1. absolute
2. relative
3. URL

For every argument, create an object using the letter associated with the example path and assign it the corresponding number from the list of path types. For example: `B = 1`. 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(A)).encode("utf-8")+b"b5ce4").hexdigest() == "36b2b9d2808d397a4670060f26363e5b404f389b", "type of A is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(A).encode("utf-8")+b"b5ce4").hexdigest() == "829943fb1d8b79216d331c4b3d7d0e6d16363227", "value of A is not correct"

assert sha1(str(type(B)).encode("utf-8")+b"b5ce5").hexdigest() == "dd8cad765f656925226fed0398d6525b3eb8058d", "type of B is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(B).encode("utf-8")+b"b5ce5").hexdigest() == "5ada940b690f9b07d7d5e5967e596016f4d7499b", "value of B is not correct"

assert sha1(str(type(C)).encode("utf-8")+b"b5ce6").hexdigest() == "a0dbff212fc29490665dce3d2835539f2b8b4bd2", "type of C is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(C).encode("utf-8")+b"b5ce6").hexdigest() == "b4b510116f9fbdff4407be21efe16990943fb0c2", "value of C is not correct"

assert sha1(str(type(D)).encode("utf-8")+b"b5ce7").hexdigest() == "f2b101c89c6b7bf334d3ed3c1e84d907198ca744", "type of D is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(D).encode("utf-8")+b"b5ce7").hexdigest() == "6ada89999d2f019646edff3b6b3114b052026328", "value of D is not correct"

assert sha1(str(type(E)).encode("utf-8")+b"b5ce8").hexdigest() == "dbf16955154ea5a690fe8f396fbe5419fa14c800", "type of E is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(E).encode("utf-8")+b"b5ce8").hexdigest() == "2790efed245518fa8a90bc71958e4f834ada72c7", "value of E is not correct"

assert sha1(str(type(F)).encode("utf-8")+b"b5ce9").hexdigest() == "4da001f4cd91e78cc6a3f61d94412135c3a948a6", "type of F is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(F).encode("utf-8")+b"b5ce9").hexdigest() == "8383de78057238262f48930aaaede30bbb987d17", "value of F is not correct"

print('Success!')

**Question 1.4** Multiple Choice:
<br> {points: 1}

If the absolute path to a data file looks like this: `/Users/my_user/Desktop/UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx`

What would the relative path look like if the working directory (i.e., where the Jupyter notebook is where you are running your Python code from) is now located in the `UBC` folder?

A. `sn_trial_1.xlsx`

B. `/SciaticNerveLab/sn_trial_1.xlsx`

C. `BIOL363/SciaticNerveLab/sn_trial_1.xlsx`

D. `UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx`

E. `/BIOL363/SciaticNerveLab/sn_trial_1.xlsx`

*Assign your answer to an object called `answer1_4`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_4)).encode("utf-8")+b"bc1fc").hexdigest() == "a2e85fef9134edc9ef0df21bbf54d2fdd986b940", "type of answer1_4 is not str. answer1_4 should be an str"
assert sha1(str(len(answer1_4)).encode("utf-8")+b"bc1fc").hexdigest() == "451775adfac58e0bed3fbe4fd52c5ebc937ffbd1", "length of answer1_4 is not correct"
assert sha1(str(answer1_4.lower()).encode("utf-8")+b"bc1fc").hexdigest() == "96c119869cbef7f689cd2e1b0d5cad1896d91e0f", "value of answer1_4 is not correct"
assert sha1(str(answer1_4).encode("utf-8")+b"bc1fc").hexdigest() == "349302ca002e89c1306a6e84de033587689823e2", "correct string value of answer1_4 but incorrect case of letters"

print('Success!')

**Question 1.5**
<br> {points: 1}

Match the following paths with the most likely kind of data format they contain. 

*Paths:*

1. `https://www.ubc.ca/datasets/data.db`
2. `/home/user/downloads/data.xlsx`
3. `data.tsv`
4. `examples/data/data.csv`
5. `https://en.wikipedia.org/wiki/Normal_distribution`

*Dataset Types:*

A. Excel Spreadsheet

B. Database

C. HTML file

D. Comma-separated values file

E. Tab-separated values file

For every dataset type, create an object using the letter associated with the example and assign it the corresponding number from the list of paths. For example: `F = 5`


In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(A)).encode("utf-8")+b"75de").hexdigest() == "1be19333ebaa4cacad500382e61f5357e265d97b", "type of A is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(A).encode("utf-8")+b"75de").hexdigest() == "df453100033b7ee9313789eb4ec7decaea4adf01", "value of A is not correct"

assert sha1(str(type(B)).encode("utf-8")+b"75df").hexdigest() == "582bef44c12268fe18a87ee895a5c91a33704991", "type of B is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(B).encode("utf-8")+b"75df").hexdigest() == "d6820e7ebfa4bb43ea53dcc836c28ce44cd8dbb3", "value of B is not correct"

assert sha1(str(type(C)).encode("utf-8")+b"75e0").hexdigest() == "29b74288e61adf0bf2d98e5f9bc5bc0a12539250", "type of C is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(C).encode("utf-8")+b"75e0").hexdigest() == "74a89c407fb08873cdf00a3f384ce231df5e9d8e", "value of C is not correct"

assert sha1(str(type(D)).encode("utf-8")+b"75e1").hexdigest() == "14c4af127c0518c1525efb4f61b20f8d4874e91b", "type of D is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(D).encode("utf-8")+b"75e1").hexdigest() == "7ef89b09e39ea5115a4ee799f2599acd1d09657f", "value of D is not correct"

assert sha1(str(type(E)).encode("utf-8")+b"75e2").hexdigest() == "02d2219aa609ab148c23b28b4a6dbbf2759e4314", "type of E is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(E).encode("utf-8")+b"75e2").hexdigest() == "5a12ff12fc9922a2515cbfcf2d656923dffbd421", "value of E is not correct"

print('Success!')

## 2. Argument Modifications to Read Data
Reading files is one of the first steps to wrangling data and consequently `pd.read_csv` is a crucial function in Python. However, despite how effortlessly it has worked so far, it has its limitations.

Not all data sets come as perfectly organized like the ones you worked with last week. Time and effort were put into ensuring that the files were arranged with headers, columns were separated by commas, and the beginning excluded metadata. 

Now that you understand how to read files located outside (or inside) of your working directory, you can begin to learn the tips and tricks necessary to overcoming the setbacks of `read_csv`. 

In [None]:
### Run this cell to learn more about the arguments used in pd.read_csv
### Reading over the help file will assist with the next question. 

?pd.read_csv

**Question 2.1** 
<br> {points: 1}

Match the following descriptions with the corresponding arguments used in `pd.read_csv`:

*Descriptions*

G. Character that separates columns in your file. 

H. Specifies a list of column names to use when reading in a file.

I. This is the file name, path to a file, or URL. 

J. Specifies the number of lines which must be ignored because they contain metadata. 


*Arguments*

1. `filepath_or_buffer`
2. `sep`
3. `names`
4. `skiprows`

For every description, create an object using the letter associated with the description and assign it the corresponding number from the list of functions. For example: `G = 1`

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(G)).encode("utf-8")+b"1393b").hexdigest() == "b4d8c13d6560540abcb2139e6770c4f068bfa815", "type of G is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(G).encode("utf-8")+b"1393b").hexdigest() == "03f0cbba08a2791b75e2384272cfd18d3bb7e074", "value of G is not correct"

assert sha1(str(type(H)).encode("utf-8")+b"1393c").hexdigest() == "0bcd4edb51361b345692c849db43ddda26a84853", "type of H is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(H).encode("utf-8")+b"1393c").hexdigest() == "9e89eeb5fb311ba2eaaf46c191fdea119dbae245", "value of H is not correct"

assert sha1(str(type(I)).encode("utf-8")+b"1393d").hexdigest() == "f013cd8ef9539bd4fbdf7aa43774c56927b51e2f", "type of I is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(I).encode("utf-8")+b"1393d").hexdigest() == "e2029852dbebcc29dc32eba4d57991ccb0ffd823", "value of I is not correct"

assert sha1(str(type(J)).encode("utf-8")+b"1393e").hexdigest() == "9c06d74a4cf6b2d721b2ccc8bbdcfff334454aef", "type of J is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(J).encode("utf-8")+b"1393e").hexdigest() == "218d272f43a3137973b91f26fb13135dceecb99c", "value of J is not correct"

print('Success!')

**Question 2.2** True or False:
<br> {points: 1}

`pd.read_csv` can be used for reading files that have columns separated by `;`. 

*Assign your answer to an object called `answer2_2`. Make sure your answer is a boolean (e.g. `True` or `False`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_2)).encode("utf-8")+b"d8e81").hexdigest() == "47f1643afdc70a2dd7340a36cf30aba89c53084d", "type of answer2_2 is not bool. answer2_2 should be a bool"
assert sha1(str(answer2_2).encode("utf-8")+b"d8e81").hexdigest() == "93798b4ea746020641592b8313b5122130348aca", "boolean value of answer2_2 is not correct"

print('Success!')

**Question 2.3** True or False: 
<br> {points: 1}

`pd.read_csv` can be used for files that have columns separated by one or more of the following characters: letters, tabs, semicolons, or commas.

*Assign your answer to an object called `answer2_3`. Make sure your answer is a boolean (e.g. `True` or `False`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_3)).encode("utf-8")+b"cb5e3").hexdigest() == "022adddca73988daa0fed2f844c6bc6d32c11f53", "type of answer2_3 is not bool. answer2_3 should be a bool"
assert sha1(str(answer2_3).encode("utf-8")+b"cb5e3").hexdigest() == "74d2a814a3f87baffa7bce2d9e338a2278684367", "boolean value of answer2_3 is not correct"

print('Success!')

## 3. Happiness Report (2017)
This data was taken from [Kaggle](https://www.kaggle.com/unsdsn/world-happiness) and ranks countries on happiness based on rationalized factors like economic growth, social support, etc. The data was released by the United Nations at an event celebrating International Day of Happiness.  According to the website, the file contains the following information:

* Country = Name of the country.
* Region = Region the country belongs to.
* Happiness Rank = Rank of the country based on the Happiness Score.
* Happiness Score = A metric measured by asking the sampled people the question: "How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest?"
* Standard Error = The standard error of the happiness score.
* Economy (GDP per Capita) = The extent to which GDP contributes to the calculation of the Happiness Score.
* Family = The extent to which Family contributes to the calculation of the Happiness Score.
* Health (Life Expectancy) = The extent to which Life expectancy contributed to the calculation of the Happiness Score.
* Freedom = The extent to which Freedom contributed to the calculation of the Happiness Score.
* Trust (Government Corruption) = The extent to which Perception of Corruption contributes to Happiness Score.
* Generosity = The extent to which Generosity contributed to the calculation of the Happiness Score.
* Dystopia Residual = The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.

To clean up the file and make it easier to read, we only kept the country name, happiness score, economy (GDP per capita), life expectancy, and freedom. The happiness scores and rankings use data from the Gallup World Poll, which surveys citizens in countries from around the world.

Kaggle stores this information but it is compiled by the *Sustainable Development Solutions Network*. They survey these factors nearly every year (since 2012) and allow global comparisons to optimize political decision making. These landmark surveys are highly recognized and allow countries to learn and grow from one another. One day, they will provide a historical insight on the nature of our time.  

**Question 3.1** Fill in the Blank: 
<br> {points: 1}

Trust is the extent to which \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ contributes to Happiness Score. 

A. Corruption 

B. Government Intervention 

C. Perception of Corruption  

D. Tax Money Designation 

*Assign your answer to an object called `answer3_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer3_1)).encode("utf-8")+b"eba1f").hexdigest() == "d3dbba74c69fd3c5d8db8fc71b40ec32ca23d9c9", "type of answer3_1 is not str. answer3_1 should be an str"
assert sha1(str(len(answer3_1)).encode("utf-8")+b"eba1f").hexdigest() == "826462804be51475650da94e3994b813484c83bc", "length of answer3_1 is not correct"
assert sha1(str(answer3_1.lower()).encode("utf-8")+b"eba1f").hexdigest() == "4018c33709ba3a35594504b4f7308d577ea75d21", "value of answer3_1 is not correct"
assert sha1(str(answer3_1).encode("utf-8")+b"eba1f").hexdigest() == "05ca36de85cf852b65c22507db83720212267da2", "correct string value of answer3_1 but incorrect case of letters"

print('Success!')

**Question 3.2** Multiple Choice: 
<br> {points: 1}

What is the happiness report?

A. Study conducted by the governments of multiple countries. 

B. Independent survey of citizens from multiple countries.

C. Study conducted by the UN. 

D. Survey given to international students by UBC's psychology department. 

*Assign your answer to an object called `answer3_2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer3_2)).encode("utf-8")+b"a7850").hexdigest() == "157121679b7151388ba166e7bae55d7da772d966", "type of answer3_2 is not str. answer3_2 should be an str"
assert sha1(str(len(answer3_2)).encode("utf-8")+b"a7850").hexdigest() == "835e43e40ddcf01d17f9c882af2019e4668e7e74", "length of answer3_2 is not correct"
assert sha1(str(answer3_2.lower()).encode("utf-8")+b"a7850").hexdigest() == "428a28c5a897d9eab291eb3de8582c61b05811e7", "value of answer3_2 is not correct"
assert sha1(str(answer3_2).encode("utf-8")+b"a7850").hexdigest() == "6288a577e7e8768a57de814c4716052cb57469c9", "correct string value of answer3_2 but incorrect case of letters"

print('Success!')

**Question 3.3** Fill in the Blanks (of the Table):
<br> {points: 1}

It is often a good idea to try to "inspect" your data to see what it looks like before trying to load it into Python. This will help you figure out the right function to call and what arguments to use. When your data are stored as plain text, you can do this easily with Jupyter (or any text editor). 

Open all the files named `happiness_report...` in the `data` folder with the plain text editor in your working directory (the `worksheet_02` directory) using Jupyter (**Right click the file -> Open With -> Editor**). This will allow you to visualize the files and the organization of your data. Based on your findings, fill in the missing items A-F in the table below. This table will be very useful to refer back to in the coming weeks. 

*You'll notice that trying to open one of the files gives you an error (`File Load Error ... is not UTF-8 encoded`). This means that this data is not stored as human-readable plain text. For this special file, just fill in the* `read_*` *function entry, the other columns will be left blank.*

|file name                       | sep      | header | metadata | skiprows               | read_*   |
|--------------------------------|------------|--------|----------|--------------------|----------|
|`_.csv`                         |`";"`, `","`, `"\"`, or `\t`|`"yes"`or `"no"`|`"yes"`or `"no"`|`NA` or # of lines|`pd.read_*`|
|`happiness_report.csv`          |,           |**A**     |no        |`NA`                  |`pd.read_csv`  |
|`happiness_report_semicolon.csv`|;           |yes     |no        |`NA`                  |**B** |
|`happiness_report.tsv`          |**C**         |yes     |no        |`NA`                  |`pd.read_csv`  |
|`happiness_report_metadata.csv` |,           |yes     |**D**       |2                   |`pd.read_csv`  |
|`happiness_report_no_header.csv`|,           |**E**      |no      |`NA`                  |`pd.read_csv`  |
|`happiness_report.xlsx`         |            |        |          |                    |**F**|

For the missing items (labelled A to F) in the table above, create an object using the letter and assign it the corresponding missing value.

For example: `A = "yes"`. The possible options for each column are given in the first row of the table. 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(A)).encode("utf-8")+b"3e09f").hexdigest() == "a3d4ab8a50aa681f34b7a9d407778f625a2d5f5e", "type of A is not str. A should be an str"
assert sha1(str(len(A)).encode("utf-8")+b"3e09f").hexdigest() == "3d90adfc829e8faf6e81c58d806f16bb87b89ee1", "length of A is not correct"
assert sha1(str(A.lower()).encode("utf-8")+b"3e09f").hexdigest() == "255959bb83951b1a54ec032d4d50fb1442c90086", "value of A is not correct"
assert sha1(str(A).encode("utf-8")+b"3e09f").hexdigest() == "255959bb83951b1a54ec032d4d50fb1442c90086", "correct string value of A but incorrect case of letters"

assert sha1(str(type(B)).encode("utf-8")+b"3e0a0").hexdigest() == "ae3f7041b926de13cbc9744517c235a02c5ae0e2", "type of B is not str. B should be an str"
assert sha1(str(len(B)).encode("utf-8")+b"3e0a0").hexdigest() == "838f1cad440dfef78d96951b6ae0529e7efd6506", "length of B is not correct"
assert sha1(str(B.lower()).encode("utf-8")+b"3e0a0").hexdigest() == "38d9a98f5f396858d8a7656413b10ac0bc3b8bcc", "value of B is not correct"
assert sha1(str(B).encode("utf-8")+b"3e0a0").hexdigest() == "38d9a98f5f396858d8a7656413b10ac0bc3b8bcc", "correct string value of B but incorrect case of letters"

assert sha1(str(type(C)).encode("utf-8")+b"3e0a1").hexdigest() == "8bacbed45c1e982ebd8f4f253ae0eea612650ad5", "type of C is not str. C should be an str"
assert sha1(str(len(C)).encode("utf-8")+b"3e0a1").hexdigest() == "59d0301c87e2080b06debbfb9b5cabef57932fc8", "length of C is not correct"
assert sha1(str(C.lower()).encode("utf-8")+b"3e0a1").hexdigest() == "628f11da608f1ff354e39fd61c94df7151e70509", "value of C is not correct"
assert sha1(str(C).encode("utf-8")+b"3e0a1").hexdigest() == "628f11da608f1ff354e39fd61c94df7151e70509", "correct string value of C but incorrect case of letters"

assert sha1(str(type(D)).encode("utf-8")+b"3e0a2").hexdigest() == "0c8bbc755005bc6df2657ebef2de45f1b6b964a9", "type of D is not str. D should be an str"
assert sha1(str(len(D)).encode("utf-8")+b"3e0a2").hexdigest() == "cd8d392b0fd8e2519f45aac202b7e811b91f53f1", "length of D is not correct"
assert sha1(str(D.lower()).encode("utf-8")+b"3e0a2").hexdigest() == "20b157bcc96217e3e2c49c4d1b21c16eabc53345", "value of D is not correct"
assert sha1(str(D).encode("utf-8")+b"3e0a2").hexdigest() == "20b157bcc96217e3e2c49c4d1b21c16eabc53345", "correct string value of D but incorrect case of letters"

assert sha1(str(type(E)).encode("utf-8")+b"3e0a3").hexdigest() == "ef7859b290186a908ba3a4a74c85f0f0c0aca39d", "type of E is not str. E should be an str"
assert sha1(str(len(E)).encode("utf-8")+b"3e0a3").hexdigest() == "f1cd04a223ed408145f767838868eada4dfc79d5", "length of E is not correct"
assert sha1(str(E.lower()).encode("utf-8")+b"3e0a3").hexdigest() == "0bcea3665ed82aeae1691bcfe611b1a3b2f83089", "value of E is not correct"
assert sha1(str(E).encode("utf-8")+b"3e0a3").hexdigest() == "0bcea3665ed82aeae1691bcfe611b1a3b2f83089", "correct string value of E but incorrect case of letters"

assert sha1(str(type(F)).encode("utf-8")+b"3e0a4").hexdigest() == "a47d057fd84860f38f189cd5eade1d9be567cf3c", "type of F is not str. F should be an str"
assert sha1(str(len(F)).encode("utf-8")+b"3e0a4").hexdigest() == "8ef341880ed78b20252abab89ce1c827fed27b92", "length of F is not correct"
assert sha1(str(F.lower()).encode("utf-8")+b"3e0a4").hexdigest() == "e9024aa272e7e0d2239bf6e8064ef448da5bde90", "value of F is not correct"
assert sha1(str(F).encode("utf-8")+b"3e0a4").hexdigest() == "e9024aa272e7e0d2239bf6e8064ef448da5bde90", "correct string value of F but incorrect case of letters"

print('Success!')

**Question 3.4** 
<br> {points: 1}

Read the file `happiness_report.csv` in the `data` folder using the shortest relative path. **Hint:** preview the data using Jupyter (as discussed above) so you know which `pd.read_*` function and arguments to use.

*Assign the relative path (the string) to an object named* `happiness_report_path`, *and assign the output of the correct* `pd.read_*` *function you call to an object named* `happiness_report`. 

In [None]:
# happiness_report_path = "___"
# ___ = ___(happiness_report_path)

# your code here
raise NotImplementedError
happiness_report

In [None]:
from hashlib import sha1
assert sha1(str(type(happiness_report_path)).encode("utf-8")+b"e989b").hexdigest() == "90662c3046cda55663e136c5f5b16e0d592a1f1f", "type of type(happiness_report_path) is not correct"

assert sha1(str(type(happiness_report_path.split("/")[-2:])).encode("utf-8")+b"e989c").hexdigest() == "940da2a4943333aabee0ba2a9de6ea9fe2ed0153", "type of happiness_report_path.split(\"/\")[-2:] is not list. happiness_report_path.split(\"/\")[-2:] should be a list"
assert sha1(str(len(happiness_report_path.split("/")[-2:])).encode("utf-8")+b"e989c").hexdigest() == "e9a08a8ba8b2c1cf5a6fa4036a56928a40c70c6a", "length of happiness_report_path.split(\"/\")[-2:] is not correct"
assert sha1(str(sorted(map(str, happiness_report_path.split("/")[-2:]))).encode("utf-8")+b"e989c").hexdigest() == "382fb1335c381db60680673e4eadfda74ad7a7f4", "values of happiness_report_path.split(\"/\")[-2:] are not correct"
assert sha1(str(happiness_report_path.split("/")[-2:]).encode("utf-8")+b"e989c").hexdigest() == "382fb1335c381db60680673e4eadfda74ad7a7f4", "order of elements of happiness_report_path.split(\"/\")[-2:] is not correct"

assert sha1(str(type(happiness_report)).encode("utf-8")+b"e989d").hexdigest() == "216bd42f45633e913cbe7de02e1fc596951fa8ce", "type of type(happiness_report) is not correct"

assert sha1(str(type(happiness_report.shape)).encode("utf-8")+b"e989e").hexdigest() == "b3d02fe17c06250d01b98a0d724f56b247c72249", "type of happiness_report.shape is not tuple. happiness_report.shape should be a tuple"
assert sha1(str(len(happiness_report.shape)).encode("utf-8")+b"e989e").hexdigest() == "3ccf50aa5ba471a44443c610855c942696be9be6", "length of happiness_report.shape is not correct"
assert sha1(str(sorted(map(str, happiness_report.shape))).encode("utf-8")+b"e989e").hexdigest() == "4a4bd4293538b446d2f265cf1c5a7fc826c186c6", "values of happiness_report.shape are not correct"
assert sha1(str(happiness_report.shape).encode("utf-8")+b"e989e").hexdigest() == "655583cc7219f7241adc53099adccb34e563b038", "order of elements of happiness_report.shape is not correct"

assert sha1(str(type(happiness_report.columns.values)).encode("utf-8")+b"e989f").hexdigest() == "8a412877db2cb764f1d24544a2096d20568be0ff", "type of happiness_report.columns.values is not correct"
assert sha1(str(happiness_report.columns.values).encode("utf-8")+b"e989f").hexdigest() == "4b3571ed2ac05db4ec7b6ef46b045f1a41f45122", "value of happiness_report.columns.values is not correct"

assert sha1(str(type(sum(happiness_report.freedom))).encode("utf-8")+b"e98a0").hexdigest() == "6a7b1d2093d089e5bf63b26803866f5fef531aab", "type of sum(happiness_report.freedom) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(happiness_report.freedom), 2)).encode("utf-8")+b"e98a0").hexdigest() == "fa4f82fb287a4f5dbdd2e04546f25f7e537f358b", "value of sum(happiness_report.freedom) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 3.5** Multiple Choice:
<br> {points: 1}

If Norway is in "first place" based on the happiness score, at what position is Canada?

A. 3rd

B. 15th

C. 7th

D. 28th

*Hint: create a new cell and run `happiness_report`.* 

*Assign your answer to an object called `answer3_5`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer3_5)).encode("utf-8")+b"12fff").hexdigest() == "640ba7ac35b03b528ea7c6427f1e2df75ebfeec1", "type of answer3_5 is not str. answer3_5 should be an str"
assert sha1(str(len(answer3_5)).encode("utf-8")+b"12fff").hexdigest() == "7f80788d24744a7970ee39276fad8e234872db74", "length of answer3_5 is not correct"
assert sha1(str(answer3_5.lower()).encode("utf-8")+b"12fff").hexdigest() == "e861046f576fad9264705ab096a8463cd931d1f1", "value of answer3_5 is not correct"
assert sha1(str(answer3_5).encode("utf-8")+b"12fff").hexdigest() == "171c15b779f86a061943e21e4ee9c1d343a163d8", "correct string value of answer3_5 but incorrect case of letters"

print('Success!')

**Question 3.6.1**
<br> {points: 1}

For each question in the ranges 3.6.1 to 3.6.5 and 3.7.1 to 3.7.2, fill in the blank (`___`) in the code given and remove the `raise NotImplementedError` for the questions where you provide an answer. Refer to your table above and don't be afraid to ask for help. Remember you can use `?` help operator to access documentation for a function (e.g. `?pd.read_csv`).

Read in the file `happiness_report_semicolon.csv` using `pd.read_csv` and name it `happy_semi_df`

In [None]:
# ___ = pd.read_csv("data/___", sep = "___")

# your code here
raise NotImplementedError
happy_semi_df

In [None]:
from hashlib import sha1
assert sha1(str(type(happy_semi_df)).encode("utf-8")+b"2189f").hexdigest() == "c451fbcc65456b510e7fe9ec04478e05a658d4ef", "type of type(happy_semi_df) is not correct"

assert sha1(str(type(happy_semi_df.shape)).encode("utf-8")+b"218a0").hexdigest() == "12b12afcdc5002cfe0cf41061a4a56591ff29e3d", "type of happy_semi_df.shape is not tuple. happy_semi_df.shape should be a tuple"
assert sha1(str(len(happy_semi_df.shape)).encode("utf-8")+b"218a0").hexdigest() == "bcc044c0060d27559a69671b0947a0177d1b2c1f", "length of happy_semi_df.shape is not correct"
assert sha1(str(sorted(map(str, happy_semi_df.shape))).encode("utf-8")+b"218a0").hexdigest() == "19c39d553c35b0d0511fc8d684c894739cf69bff", "values of happy_semi_df.shape are not correct"
assert sha1(str(happy_semi_df.shape).encode("utf-8")+b"218a0").hexdigest() == "68e5588e5e910a5e7da7d623cd660283830cf65e", "order of elements of happy_semi_df.shape is not correct"

assert sha1(str(type(happy_semi_df.columns.values)).encode("utf-8")+b"218a1").hexdigest() == "29620d9553fa08829fe32933b4e8fcfb10fdbf56", "type of happy_semi_df.columns.values is not correct"
assert sha1(str(happy_semi_df.columns.values).encode("utf-8")+b"218a1").hexdigest() == "ae49416f4c7ac1ba8cd28846708a999b774fbebb", "value of happy_semi_df.columns.values is not correct"

assert sha1(str(type(sum(np.array([st.replace(",", ".") for st in happy_semi_df.freedom.astype(str)]).astype(float)))).encode("utf-8")+b"218a2").hexdigest() == "922d5a34544ea829ca324c33cfecae03e2785fe5", "type of sum(np.array([st.replace(\",\", \".\") for st in happy_semi_df.freedom.astype(str)]).astype(float)) is not correct"
assert sha1(str(sum(np.array([st.replace(",", ".") for st in happy_semi_df.freedom.astype(str)]).astype(float))).encode("utf-8")+b"218a2").hexdigest() == "0f183d0e48702ff7692cfceea80175df5ec01e5f", "value of sum(np.array([st.replace(\",\", \".\") for st in happy_semi_df.freedom.astype(str)]).astype(float)) is not correct"

print('Success!')

Take a look at the `happiness_score`, `GDP_per_capita`, `life_expectancy`, and `freedom` columns. It looks odd that the comma (`,`) shows up as the decimal separator instead of a point `.`. If you take a look at the data types with `happy_semi_df.info()` you would see that those columns have been read in as objects rather than numeric as we would hope! What happened?

If we look closer, we'll see that the decimal point in this data was a *comma* `,` rather than a period (common in some European countries). To make sure this data was read in correctly, we would have needed to add `decimal=','` inside `read_csv`.

**Question 3.6.2** True or False:
<br> {points: 1}

Read the documentation of `read_csv`. The `deliminator` parameter is useful even if you are already using the `sep` parameter, since it allows to specify additional separation characters.

*Assign your answer to an object called `answer3_6_2`. Make sure your answer is a boolean (e.g. `True` or `False`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer3_6_2)).encode("utf-8")+b"86d1b").hexdigest() == "6dea0f97a5e4b8d4b93d2c498f2319fcb69d37d5", "type of answer3_6_2 is not bool. answer3_6_2 should be a bool"
assert sha1(str(answer3_6_2).encode("utf-8")+b"86d1b").hexdigest() == "863fd75e68e91ced7bf11b225d6b8bb8c02b3c41", "boolean value of answer3_6_2 is not correct"

print('Success!')

**Question 3.6.3**
<br> {points: 1}

Read in the file `happiness_report.tsv` using the appropriate value for the `sep` parameter and name it `happy_tsv`.

In [None]:
# ___ = ___("___")

# your code here
raise NotImplementedError
happy_tsv

In [None]:
from hashlib import sha1
assert sha1(str(type(happy_tsv)).encode("utf-8")+b"664d3").hexdigest() == "d2e8afe46e31ca8b48a3861dd3430cc5e475e682", "type of type(happy_tsv) is not correct"

assert sha1(str(type(happy_tsv.shape)).encode("utf-8")+b"664d4").hexdigest() == "b0eabe81a070c2d9e0b44a8c7cab1a5a5d1494f2", "type of happy_tsv.shape is not tuple. happy_tsv.shape should be a tuple"
assert sha1(str(len(happy_tsv.shape)).encode("utf-8")+b"664d4").hexdigest() == "a2ca923953d062e692c848ecd9110e41da4dd956", "length of happy_tsv.shape is not correct"
assert sha1(str(sorted(map(str, happy_tsv.shape))).encode("utf-8")+b"664d4").hexdigest() == "d73c6012e4fb6498b4884460a8642711eef8ebb0", "values of happy_tsv.shape are not correct"
assert sha1(str(happy_tsv.shape).encode("utf-8")+b"664d4").hexdigest() == "e93016850f72aa6b2f4bc30f8b1bb206ce5dfedc", "order of elements of happy_tsv.shape is not correct"

assert sha1(str(type(happy_tsv.columns.values)).encode("utf-8")+b"664d5").hexdigest() == "4213bbad83b1eedb4c1737f68718119c8c0e8c93", "type of happy_tsv.columns.values is not correct"
assert sha1(str(happy_tsv.columns.values).encode("utf-8")+b"664d5").hexdigest() == "b36f5d1334eb34e37cc3125a64c1a72556796790", "value of happy_tsv.columns.values is not correct"

assert sha1(str(type(sum(happy_tsv.freedom))).encode("utf-8")+b"664d6").hexdigest() == "a60cb8301749f4a12fdf6d817737c44b6696e2a4", "type of sum(happy_tsv.freedom) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(happy_tsv.freedom), 2)).encode("utf-8")+b"664d6").hexdigest() == "c3c923c13528009a4cadaeff8fdf725e0a12e3bc", "value of sum(happy_tsv.freedom) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 3.6.4**
<br> {points: 1}

Read in the file `happiness_report_metadata.csv` using the appropriate function and parameters. Name it `happy_metadata`.

In [None]:
# ___ = ___(
#     "data/happiness_report_metadata.csv", skiprows=___
# )

# your code here
raise NotImplementedError
happy_metadata

In [None]:
from hashlib import sha1
assert sha1(str(type(happy_metadata)).encode("utf-8")+b"d0517").hexdigest() == "6736e7a1579436044166f53fef68820945341f6e", "type of type(happy_metadata) is not correct"

assert sha1(str(type(happy_metadata.shape)).encode("utf-8")+b"d0518").hexdigest() == "77f7d638f8ba0f96a9a796b00c15396241349f61", "type of happy_metadata.shape is not tuple. happy_metadata.shape should be a tuple"
assert sha1(str(len(happy_metadata.shape)).encode("utf-8")+b"d0518").hexdigest() == "fbcf63784792ce14112711b63ffd7c1db05effd9", "length of happy_metadata.shape is not correct"
assert sha1(str(sorted(map(str, happy_metadata.shape))).encode("utf-8")+b"d0518").hexdigest() == "44e585af394d48797704c771f723a93efaf54fda", "values of happy_metadata.shape are not correct"
assert sha1(str(happy_metadata.shape).encode("utf-8")+b"d0518").hexdigest() == "c0d3186c836dc3acaa811a21a4edd6e5d70bf18f", "order of elements of happy_metadata.shape is not correct"

assert sha1(str(type(happy_metadata.columns.values)).encode("utf-8")+b"d0519").hexdigest() == "5b5ddf2be635f9d56fa722a2e82625c9ace60e09", "type of happy_metadata.columns.values is not correct"
assert sha1(str(happy_metadata.columns.values).encode("utf-8")+b"d0519").hexdigest() == "988eda3540e1bb5e5997e721690eb3d249b15786", "value of happy_metadata.columns.values is not correct"

assert sha1(str(type(sum(happy_metadata.freedom))).encode("utf-8")+b"d051a").hexdigest() == "1c60123f0b04df1c691dd80beccb42849ac11853", "type of sum(happy_metadata.freedom) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(happy_metadata.freedom), 2)).encode("utf-8")+b"d051a").hexdigest() == "780cce5a60aed6f9f7968ef9d3c622327c413c07", "value of sum(happy_metadata.freedom) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 3.6.5**
<br> {points: 1}

Read in the file `happiness_report_no_header.csv` and name it `happy_header`. 
Note: If the argument `names` is a list, the values will be used as the names of the columns.

In [None]:
# ___ = ___(
#     "___",
#     names=[
#         "country",
#         "happiness_score",
#         "GDP_per_capita",
#         "life_expectancy",
#         "freedom",
#     ],
# )

# your code here
raise NotImplementedError
happy_header

In [None]:
from hashlib import sha1
assert sha1(str(type(happy_header)).encode("utf-8")+b"1f2ff").hexdigest() == "239e5133dc30b9f5f5b02b8983c18c969add83e8", "type of type(happy_header) is not correct"

assert sha1(str(type(happy_header.shape)).encode("utf-8")+b"1f300").hexdigest() == "ca791f2afe59750143e1b455cb011584fcc0056a", "type of happy_header.shape is not tuple. happy_header.shape should be a tuple"
assert sha1(str(len(happy_header.shape)).encode("utf-8")+b"1f300").hexdigest() == "fcacafe570aa3b1f20b08f95f4690e4461a672e7", "length of happy_header.shape is not correct"
assert sha1(str(sorted(map(str, happy_header.shape))).encode("utf-8")+b"1f300").hexdigest() == "05aa81ce22b273f15fb3c602b6894e23726a5fa9", "values of happy_header.shape are not correct"
assert sha1(str(happy_header.shape).encode("utf-8")+b"1f300").hexdigest() == "6d3d9efbda6f801877c7dc4a98187fe8f99ed77f", "order of elements of happy_header.shape is not correct"

assert sha1(str(type(happy_header.columns.values)).encode("utf-8")+b"1f301").hexdigest() == "e2e4d738e4ad32442c29666d9371ebfde814b73d", "type of happy_header.columns.values is not correct"
assert sha1(str(happy_header.columns.values).encode("utf-8")+b"1f301").hexdigest() == "2f0e80537d4524156f95794fae1790c45d23f45e", "value of happy_header.columns.values is not correct"

assert sha1(str(type(sum(happy_header.freedom))).encode("utf-8")+b"1f302").hexdigest() == "4a8cbcdb373b79f3314cf7d348afa525e67a1ee6", "type of sum(happy_header.freedom) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(happy_header.freedom), 2)).encode("utf-8")+b"1f302").hexdigest() == "e39574c00a672cf7b98538008a2870cd13dfeb9e", "value of sum(happy_header.freedom) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 3.7**
<br> {points: 1}

Earlier when you tried to open `happiness_report.xlsx` in Jupyter, you received an error message `(File Load Error ... is not UTF-8 encoded)`. This happens because Excel spreadsheet files are not stored in plain text, and so Jupyter can't open them with its default text viewing program. This makes them a bit harder to inspect before trying to open in `Python`.

To inspect the data, we will just try to load `happiness_report.xlsx` using the most basic form of the appropriate `read_*` function, passing only the filename as an argument. Assign the output to a variable called `happy_xlsx`.

*Note: you can also try to examine `.xlsx` files with Microsoft Excel or Google Sheets before loading into Python.*

In [None]:
# ___ = ___("___")

# your code here
raise NotImplementedError
happy_xlsx

In [None]:
from hashlib import sha1
assert sha1(str(type(happy_xlsx)).encode("utf-8")+b"73402").hexdigest() == "cb9aebeba5dcb422fbe7d3f60584bd1156756ead", "type of type(happy_xlsx) is not correct"

assert sha1(str(type(happy_xlsx.shape)).encode("utf-8")+b"73403").hexdigest() == "3b4b8a6ef9c874df90c6ed29ea968051787df0e1", "type of happy_xlsx.shape is not tuple. happy_xlsx.shape should be a tuple"
assert sha1(str(len(happy_xlsx.shape)).encode("utf-8")+b"73403").hexdigest() == "ee1dfcec3c009fd4f2c7d368a5a04c4bd635ea42", "length of happy_xlsx.shape is not correct"
assert sha1(str(sorted(map(str, happy_xlsx.shape))).encode("utf-8")+b"73403").hexdigest() == "3c62e078f906a023ec350516ff29f75c0d29bfde", "values of happy_xlsx.shape are not correct"
assert sha1(str(happy_xlsx.shape).encode("utf-8")+b"73403").hexdigest() == "5831b74a61d2c0b032064e56504c13d91613a3b6", "order of elements of happy_xlsx.shape is not correct"

assert sha1(str(type(happy_xlsx.columns.values)).encode("utf-8")+b"73404").hexdigest() == "88487d5709e5033e9cab4dced7d8157b1363282f", "type of happy_xlsx.columns.values is not correct"
assert sha1(str(happy_xlsx.columns.values).encode("utf-8")+b"73404").hexdigest() == "4cbc56b86fae9eebffecf0533b35b0ae2b9e1478", "value of happy_xlsx.columns.values is not correct"

assert sha1(str(type(sum(happy_xlsx.freedom))).encode("utf-8")+b"73405").hexdigest() == "017e2dbe908e88ec3fdcca0ba67dc8e6496de653", "type of sum(happy_xlsx.freedom) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(happy_xlsx.freedom), 2)).encode("utf-8")+b"73405").hexdigest() == "e3e16d66f1e945ad5b17a10dd7b5569afc2e6fe2", "value of sum(happy_xlsx.freedom) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 3.8** 
<br> {points: 1}

Opening the data on a text editor showed some clear differences. Do all the data sets look the same once reading them on your Python notebook (`"yes"` or `"no"`)? 
 
*Assign your answer to an object called `answer3_8`. Make sure your answer is in lowercase and is surrounded by quotation marks (e.g. `"yes"` or `"no"`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer3_8)).encode("utf-8")+b"1b91f").hexdigest() == "23e8e422b21e6cbb76d433c43327711418b388b8", "type of answer3_8 is not str. answer3_8 should be an str"
assert sha1(str(len(answer3_8)).encode("utf-8")+b"1b91f").hexdigest() == "a681e5f0f475e4b18dffce9937929c7528f03fdf", "length of answer3_8 is not correct"
assert sha1(str(answer3_8.lower()).encode("utf-8")+b"1b91f").hexdigest() == "761537528d75363932d7416d4d9fb7693d9ec637", "value of answer3_8 is not correct"
assert sha1(str(answer3_8).encode("utf-8")+b"1b91f").hexdigest() == "761537528d75363932d7416d4d9fb7693d9ec637", "correct string value of answer3_8 but incorrect case of letters"

print('Success!')

**Question 3.9** 
<br> {points: 1}

Using the `happy_header` data set that you read earlier, plot `life_expectancy` vs. `GDP_per_capita`. Note that the statement "plot A vs. B" usually means to plot A on the y-axis, and B on the x-axis. Be sure to give your axes human-readable titles.

*Assign your answer to an object called `header_plot`.* 

In [None]:
# your code here
raise NotImplementedError
header_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(header_plot.encoding.x['shorthand'])).encode("utf-8")+b"15eb8").hexdigest() == "14ae9edb99e3d055acaf4b4ea2ce491af1a529e0", "type of header_plot.encoding.x['shorthand'] is not str. header_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(header_plot.encoding.x['shorthand'])).encode("utf-8")+b"15eb8").hexdigest() == "44e3eda072e8e374aae0ed7d12eda2cb684875af", "length of header_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(header_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"15eb8").hexdigest() == "4b78678661341a6835965d137d01aac5b98b4abe", "value of header_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(header_plot.encoding.x['shorthand']).encode("utf-8")+b"15eb8").hexdigest() == "83d599acb9395cc4fa8d02a9c4ee54da71cc22c9", "correct string value of header_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(header_plot.encoding.y['shorthand'])).encode("utf-8")+b"15eb9").hexdigest() == "43121eaef816898c222dc730f50c90901497084d", "type of header_plot.encoding.y['shorthand'] is not str. header_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(header_plot.encoding.y['shorthand'])).encode("utf-8")+b"15eb9").hexdigest() == "9e9c1a282f9254b1c20a96c2154ea88bbf701943", "length of header_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(header_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"15eb9").hexdigest() == "15b5192c1bd72b1010dad6d9053ffd6aa9c95e14", "value of header_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(header_plot.encoding.y['shorthand']).encode("utf-8")+b"15eb9").hexdigest() == "15b5192c1bd72b1010dad6d9053ffd6aa9c95e14", "correct string value of header_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(header_plot.mark)).encode("utf-8")+b"15eba").hexdigest() == "024b07a5b8421a2ff6a14972e8d919564427a2c8", "type of header_plot.mark is not str. header_plot.mark should be an str"
assert sha1(str(len(header_plot.mark)).encode("utf-8")+b"15eba").hexdigest() == "8adf52de2c2c8402f905867e7c2e94ef3a793246", "length of header_plot.mark is not correct"
assert sha1(str(header_plot.mark.lower()).encode("utf-8")+b"15eba").hexdigest() == "8f9eb524983405c49ab8aa881bb1e2775ed684f1", "value of header_plot.mark is not correct"
assert sha1(str(header_plot.mark).encode("utf-8")+b"15eba").hexdigest() == "8f9eb524983405c49ab8aa881bb1e2775ed684f1", "correct string value of header_plot.mark but incorrect case of letters"

assert sha1(str(type(isinstance(header_plot.encoding.x['title'], str))).encode("utf-8")+b"15ebb").hexdigest() == "d63958b79bf8a82ddc7a187a0704d9d8ec6096e9", "type of isinstance(header_plot.encoding.x['title'], str) is not bool. isinstance(header_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(header_plot.encoding.x['title'], str)).encode("utf-8")+b"15ebb").hexdigest() == "b658e5d4deb4b826e31074560b1cfd9e98d2fd0c", "boolean value of isinstance(header_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(header_plot.encoding.y['title'], str))).encode("utf-8")+b"15ebc").hexdigest() == "cea66b60c55fb85bf0bf893da9088b3f37e57bfc", "type of isinstance(header_plot.encoding.y['title'], str) is not bool. isinstance(header_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(header_plot.encoding.y['title'], str)).encode("utf-8")+b"15ebc").hexdigest() == "c68349f8ef019c399cf7ed49c1180b6e1a7c4e25", "boolean value of isinstance(header_plot.encoding.y['title'], str) is not correct"

print('Success!')

## 4. Reading Data from a Database

### Investigating the reliability of flights into and out of Boston Logan International Airport

Delays and cancellations seem to be an unavoidable risk of air travel. A missed connection, or hours spent waiting at the departure gate, might make you wonder though: how reliable is air travel, *really*?

The US Bureau of Transportation Statistics keeps a continually-updated [Airline On-Time Performance Dataset](https://transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data) that has tracked the scheduled and actual departure / arrival time of flights in the United States from 1987 to the present day. In this section we'll do some exploration of this data to try to answer some of the above questions. The actual data we'll be using was from only the year 2015, and was compiled into the [2015 Kaggle Flight Delays Dataset](https://www.kaggle.com/usdot/flight-delays) from the raw Bureau data. But even that  dataset is too large to handle in this course (5.8 million flights in just one year!), so the data have been filtered down to flights that either depart or arrive at Logan International Airport (`BOS`), resulting in around 209,000 flight records. 

Our data has the following variables (columns):

- year
- month
- day
- day of the week (from 1 - 7.999..., with fractional days based on departure time)
- origin airport code
- destination airport code
- flight distance (miles)
- scheduled departure time (local)
- departure delay (minutes)
- scheduled arrival time (local)
- arrival delay (minutes)
- diverted? (True/False)
- cancelled? (True/False)




**Question 4.1** True or False:
<br> {points: 1}

We can use our dataset to figure out which airline company was the least likely to experience a flight delay in 2015.

*Assign your answer to an object called `answer4_1`. Make sure your answer is a boolean (e.g. `True` or `False`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer4_1)).encode("utf-8")+b"16442").hexdigest() == "48e3d1060d63cd176a8c7cf0feb68ec8a31219d6", "type of answer4_1 is not bool. answer4_1 should be a bool"
assert sha1(str(answer4_1).encode("utf-8")+b"16442").hexdigest() == "cf6ac610b8a2099176561e7b2752ab3620dd9877", "boolean value of answer4_1 is not correct"

print('Success!')

**Question 4.2** Multiple Choice
<br> {points: 1}

If we're mostly concerned with getting to our destination on time, which variable in our dataset should we use as the y-axis of a plot?

A. flight distance

B. departure delay

C. origin airport code

D. arrival delay

*Assign your answer as a single character to an object called `answer4_2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer4_2)).encode("utf-8")+b"d4363").hexdigest() == "69917e3a03625a8bea9ae25d0eab9b8b8c6340cf", "type of answer4_2 is not str. answer4_2 should be an str"
assert sha1(str(len(answer4_2)).encode("utf-8")+b"d4363").hexdigest() == "f5358d747b3170deaef147894ad88b2ab37a6dc0", "length of answer4_2 is not correct"
assert sha1(str(answer4_2.lower()).encode("utf-8")+b"d4363").hexdigest() == "155d7003e10d4bc0b5cb7e6a4d55ce424a457868", "value of answer4_2 is not correct"
assert sha1(str(answer4_2).encode("utf-8")+b"d4363").hexdigest() == "2bb398e7e751be1920680368fc37318bdd3d0612", "correct string value of answer4_2 but incorrect case of letters"

print('Success!')

Let's start exploring our data. The file is stored in `data/flights_filtered.db` in your working directory (still the `py_worksheet_reading` folder). If you try to open the file in Jupyter to inspect its contents, you'll again run into the `File Load Error ... is not UTF-8 encoded` message you got earlier when trying to open an Excel spreadsheet in Jupyter. This is because the file is a *database* (often denoted by the `.db` extension), which are usually not stored in plain text. 

We'll need more Python packages to help us handle this kind of data: In this course, we will work with the [`ibis` package](https://ibis-project.org/docs/3.2.0/).

Let's load that now



In [None]:
# Run this cell before continuing.
import ibis

In order to open a database in Python, you need to take the following steps:

1. Connect to the database. For an SQLite database, we will do that using 
the `connect` function from the
`sqlite` backend in the
`ibis` package. This command does not read
in the data, but simply tells Python where the database is and opens up a
communication channel that Python can use to send SQL commands to the database.
2. Check what tables (similar to pandas dataframes, Excel spreadsheets) are in the database using the `list_tables` function
3. Once you've picked a table, create a Python object for it using the `table` function from the `conn` object
4. You can then interact with this table using familiar commands like `head` or `[]` and don't forget to use `execute` to get back a pandas data frame

The next few questions will walk you through this process.


**Question 4.3.1** 
<br> {points: 1}

Use the `connect` function from the `sqlite` backend in the `ibis` package to open and connect to the `flights_filtered.db` database in the `data` folder.

*Assign the output to a variable named `conn`*.

In [None]:
# conn = ibis.sqlite.connect("___")  #replace ___ with the database relative path
# 

# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(type(conn))).encode("utf-8")+b"f016a").hexdigest() == "4e2b15a607f38c13976bacec14b17efae4fef1e5", "type of type(conn) is not correct"
assert sha1(str(type(conn)).encode("utf-8")+b"f016a").hexdigest() == "48bd08728ae5330a3167673b49c26b33540051a8", "value of type(conn) is not correct"

assert sha1(str(type(conn.list_tables())).encode("utf-8")+b"f016b").hexdigest() == "4722e164a9785101340a17116f69566c339471c2", "type of conn.list_tables() is not list. conn.list_tables() should be a list"
assert sha1(str(len(conn.list_tables())).encode("utf-8")+b"f016b").hexdigest() == "d3853298a211dba07b2ac4d22e73f7b7a15b40f6", "length of conn.list_tables() is not correct"
assert sha1(str(sorted(map(str, conn.list_tables()))).encode("utf-8")+b"f016b").hexdigest() == "0870a35799e8d92d2de94341dfe695962d049a1b", "values of conn.list_tables() are not correct"
assert sha1(str(conn.list_tables()).encode("utf-8")+b"f016b").hexdigest() == "0870a35799e8d92d2de94341dfe695962d049a1b", "order of elements of conn.list_tables() is not correct"

print('Success!')

**Question 4.3.2**
<br> {points: 1}

Use the `list_tables` function to inspect the database stored in the engine to see what tables it contains.

*Make a new variable named `flights_table_name` that stores the name of the table with our data in it*

In [None]:
# Use this cell to figure out how to answer the question
# Call the table_names function in this cell and take a look at the output
# once you've called this and seen the output, insert the output string in the cell below as denoted

# ___.list_tables() # replace ___ with the right argument

In [None]:
# ___ = '___'

# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(flights_table_name)).encode("utf-8")+b"49261").hexdigest() == "f251b75486fdfc20173bef5cd907c01c5e159162", "type of flights_table_name is not str. flights_table_name should be an str"
assert sha1(str(len(flights_table_name)).encode("utf-8")+b"49261").hexdigest() == "7dc1eca705c6114900d6bd337eeb971842a47373", "length of flights_table_name is not correct"
assert sha1(str(flights_table_name.lower()).encode("utf-8")+b"49261").hexdigest() == "ce09dc88f0ea22d8f66bdfa6bd872eca2789cfb8", "value of flights_table_name is not correct"
assert sha1(str(flights_table_name).encode("utf-8")+b"49261").hexdigest() == "ce09dc88f0ea22d8f66bdfa6bd872eca2789cfb8", "correct string value of flights_table_name but incorrect case of letters"

print('Success!')

**Question 4.3.3**
<br> {points: 1}

Use the `table` function from the `conn` object to create a Python reference to the table and call this `flight_data`

*Make a new variable named `flight_data`.*

In [None]:
# flight_data = conn.table(_____)

# your code here
raise NotImplementedError
flight_data 

In [None]:
from hashlib import sha1
assert sha1(str(type(type(flight_data))).encode("utf-8")+b"5b4df").hexdigest() == "db872784af29d918b6d5279c1d01141997fbfdf9", "type of type(flight_data) is not correct"
assert sha1(str(type(flight_data)).encode("utf-8")+b"5b4df").hexdigest() == "6affb78734c68e7e7320d9d0da94eb6202730428", "value of type(flight_data) is not correct"

assert sha1(str(type(flight_data.columns)).encode("utf-8")+b"5b4e0").hexdigest() == "1dd6955c81fdfac3ffae551b7bbff30b008e9653", "type of flight_data.columns is not list. flight_data.columns should be a list"
assert sha1(str(len(flight_data.columns)).encode("utf-8")+b"5b4e0").hexdigest() == "5bd070e0b5ee9cd04f6b6ca3586c4ac83e74e331", "length of flight_data.columns is not correct"
assert sha1(str(sorted(map(str, flight_data.columns))).encode("utf-8")+b"5b4e0").hexdigest() == "e7b6677ca9c3510a89d14d22742127f7a27b5bbb", "values of flight_data.columns are not correct"
assert sha1(str(flight_data.columns).encode("utf-8")+b"5b4e0").hexdigest() == "873c4a97802fed55c713957c421183036e25b5d5", "order of elements of flight_data.columns is not correct"

print('Success!')

Now that we've connected to the database and created an pandas dataframe object, we'll take a look at the first few rows and columns of the flight on-time performance data. So let's try using the `head` function (which allows us to see the first few rows of a dataset) and see what happens:

In [None]:
# Run this cell before continuing.
flight_data.head()

Although it looks like we might have obtained the whole data frame from the database, we didn't!
It's a *reference*; the data is still stored only in the SQLite database. The `flight_data` object  
is an `AlchemyTable` (`ibis` is using `sqlalchemy` under the hood!), which, when printed, tells
you which columns are available in the table. 

When we
write `flight_data.head().execute()` in Python, in the background, the `execute` function is
translating the Python code into SQL, sending that SQL to the database, and then translating the
response for us.

In [None]:
# Run this cell before continuing.
flight_data.head().execute()

It works! And---as luck would have it---it also works to use the `[]` and `loc[]` functions you've learned about previously. Just don't forget to add `execute`! 

**Question 4.4**
<br> {points: 1}

Use `[]` to extract the **arrival and departure delay** columns for rows where **the origin airport is BOS.** This is done in two steps, first we filter rows and second, select columns. 

*Store your answer in a variable called* `delay_data`.

In [None]:
# fd_bos_origin = flight_data[___ == "___"].execute()
# delay_data = fd_bos_origin[[___, ___]]

# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(delay_data.shape)).encode("utf-8")+b"13444").hexdigest() == "695d1b5ee603d0f2406935754b408443ca3dace8", "type of delay_data.shape is not tuple. delay_data.shape should be a tuple"
assert sha1(str(len(delay_data.shape)).encode("utf-8")+b"13444").hexdigest() == "1c8ea23931030012973f3c43023eaec3f88c147c", "length of delay_data.shape is not correct"
assert sha1(str(sorted(map(str, delay_data.shape))).encode("utf-8")+b"13444").hexdigest() == "97a5efa0caa3882b26d200f8e0f7b49e8db21808", "values of delay_data.shape are not correct"
assert sha1(str(delay_data.shape).encode("utf-8")+b"13444").hexdigest() == "7373d952e329b5cd252814b5fca640c71d727b02", "order of elements of delay_data.shape is not correct"

assert sha1(str(type(delay_data.columns.values)).encode("utf-8")+b"13445").hexdigest() == "bed552741f7f4e4011a8bafb6a457016c6a65e98", "type of delay_data.columns.values is not correct"
assert sha1(str(delay_data.columns.values).encode("utf-8")+b"13445").hexdigest() == "b7dc00aa0fc7436993c044fea4b541892ed7831b", "value of delay_data.columns.values is not correct"

print('Success!')

In [None]:
# Take a look at `delay_data` to make sure it has the two columns we expect.
# Run this cell before continuing.
delay_data.head()

Our next task is to visualize our data to see whether there is a difference in delays for arrivals at and departures from `BOS`. But before we do that, let's figure out just how much data we're working with using the `shape` method.

In [None]:
# Run this cell before continuing.
delay_data.shape

Yikes---that's a lot of data! If we tried to do a scatter plot of these, we probably wouldn't be able to see anything useful; all the points would be mushed together. Let's try using a *histogram* instead. A histogram helps us visualize how a particular variable is distributed in a dataset. It does this by separating the data into *bins*, and then plotting vertical bars showing how many data points fell in each bin.

For example, we could use a histogram to visualize the distribution of IMDB ratings of different movies with the `mark_bar`.

In [None]:
happiness_data = pd.read_csv(
    "data/happiness_report_metadata.csv", skiprows=2
)

alt.Chart(happiness_data).mark_bar().encode(
    x=alt.X("happiness_score").bin(),
    y="count()",
)

We'll use histograms to visualize the departure delay times and arrival delay times separately.

**Question 4.5**
<br> {points: 1}

Plot the **arrival** delay time data as a histogram. You will plot the delay (in hours) separated into 15-minute-wide bins on the x axis. The y axis will show the percentage of flights departing BOS that had that amount of delay during 2015.

The plotting code is provided below, however, you would need to finish the data wrangling part for the plot by filling `___` with the correct code. Please note that we would want the delay in hours on X axis but the delay data in the dataset is in minutes. To create this new column, we use the `assign` function in Pandas.

*Assign the output of altair plot to an object called* `arrival_delay_plot`.

In [None]:
# Replace each ___ with the correct item in the list above.

# delay_data = delay_data.assign(
#     ARRIVAL_DELAY_hr=___
# )
#
# ___ = alt.Chart(delay_data).transform_calculate(
#     row_proportion=f"1 / {delay_data.shape[0]}"
# ).mark_bar().encode(
#     alt.X("ARRIVAL_DELAY_hr:Q")
#         .bin(step=0.25, extent=[-2, 5])
#         .title("Delay (hours)"),
#     alt.Y("sum(row_proportion):Q")
#         .axis(format="%")
#         .title("% of Flights")
# )

# your code here
raise NotImplementedError
arrival_delay_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(arrival_delay_plot.mark)).encode("utf-8")+b"5ac0").hexdigest() == "6a78f9ea4357b43ef527f1967ebccf4f04a9bccf", "type of arrival_delay_plot.mark is not str. arrival_delay_plot.mark should be an str"
assert sha1(str(len(arrival_delay_plot.mark)).encode("utf-8")+b"5ac0").hexdigest() == "9902d3a0b445e85853c1c110a6a73c60e409015c", "length of arrival_delay_plot.mark is not correct"
assert sha1(str(arrival_delay_plot.mark.lower()).encode("utf-8")+b"5ac0").hexdigest() == "5a62975c2b11af74f0ab38690b8bf71d726eb6d9", "value of arrival_delay_plot.mark is not correct"
assert sha1(str(arrival_delay_plot.mark).encode("utf-8")+b"5ac0").hexdigest() == "5a62975c2b11af74f0ab38690b8bf71d726eb6d9", "correct string value of arrival_delay_plot.mark but incorrect case of letters"

assert sha1(str(type(arrival_delay_plot.encoding.x['title'])).encode("utf-8")+b"5ac1").hexdigest() == "9e48c1ff65bd936247b977c76732f62c06f3fa49", "type of arrival_delay_plot.encoding.x['title'] is not str. arrival_delay_plot.encoding.x['title'] should be an str"
assert sha1(str(len(arrival_delay_plot.encoding.x['title'])).encode("utf-8")+b"5ac1").hexdigest() == "3f36ff4a44a27a41156a8ff68711a1205dce0c62", "length of arrival_delay_plot.encoding.x['title'] is not correct"
assert sha1(str(arrival_delay_plot.encoding.x['title'].lower()).encode("utf-8")+b"5ac1").hexdigest() == "129661a2ec5ea4749727af6707def7584fb3b5f5", "value of arrival_delay_plot.encoding.x['title'] is not correct"
assert sha1(str(arrival_delay_plot.encoding.x['title']).encode("utf-8")+b"5ac1").hexdigest() == "00d375f6d987e249c4e0ee5bb0e85532d79b6318", "correct string value of arrival_delay_plot.encoding.x['title'] but incorrect case of letters"

assert sha1(str(type(arrival_delay_plot.encoding.y['title'])).encode("utf-8")+b"5ac2").hexdigest() == "20b34b4c5bf6835b0bdde2c9326f39fe38f90ff4", "type of arrival_delay_plot.encoding.y['title'] is not str. arrival_delay_plot.encoding.y['title'] should be an str"
assert sha1(str(len(arrival_delay_plot.encoding.y['title'])).encode("utf-8")+b"5ac2").hexdigest() == "359ebead2fa9e1c28ff3e74d8e3c9cc6f02076b6", "length of arrival_delay_plot.encoding.y['title'] is not correct"
assert sha1(str(arrival_delay_plot.encoding.y['title'].lower()).encode("utf-8")+b"5ac2").hexdigest() == "d7e81736bb65aec2e8a5c0dc86afe60f0225c98b", "value of arrival_delay_plot.encoding.y['title'] is not correct"
assert sha1(str(arrival_delay_plot.encoding.y['title']).encode("utf-8")+b"5ac2").hexdigest() == "a45905c9d560a2940a8801dff3bbd3428e63d20c", "correct string value of arrival_delay_plot.encoding.y['title'] but incorrect case of letters"

assert sha1(str(type(arrival_delay_plot.encoding.x['shorthand'])).encode("utf-8")+b"5ac3").hexdigest() == "0093e11d129d477fbfb12a0f0b95ab01513f4a9c", "type of arrival_delay_plot.encoding.x['shorthand'] is not str. arrival_delay_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(arrival_delay_plot.encoding.x['shorthand'])).encode("utf-8")+b"5ac3").hexdigest() == "f28190811225d4acadfa293099c69eb5a8ac0109", "length of arrival_delay_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(arrival_delay_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"5ac3").hexdigest() == "00d0db78285a4b15b59245ae3265a20ae0b1093e", "value of arrival_delay_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(arrival_delay_plot.encoding.x['shorthand']).encode("utf-8")+b"5ac3").hexdigest() == "c2ecd1e3c8686ecac8641fe9738435ce590f5ee7", "correct string value of arrival_delay_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(arrival_delay_plot.encoding.y['shorthand'])).encode("utf-8")+b"5ac4").hexdigest() == "1a9d23c967a147922df77fd6e19c995c2c4d2a34", "type of arrival_delay_plot.encoding.y['shorthand'] is not str. arrival_delay_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(arrival_delay_plot.encoding.y['shorthand'])).encode("utf-8")+b"5ac4").hexdigest() == "6e39f1bde7a4099e1b2a924a3ae1570968b0d900", "length of arrival_delay_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(arrival_delay_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"5ac4").hexdigest() == "cc02476194f9695fcf0caa3b6d6762b8a2a84b1e", "value of arrival_delay_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(arrival_delay_plot.encoding.y['shorthand']).encode("utf-8")+b"5ac4").hexdigest() == "a69ad692a38dcc7ba859982a81cb0a06514ca2ba", "correct string value of arrival_delay_plot.encoding.y['shorthand'] but incorrect case of letters"

print('Success!')

**Question 4.6**
<br> {points: 1}

Plot the **departure** delay time data as a histogram with the same format as the previous plot. **Hint:** copy and paste your code from the previous block! The only thing that will change is column from `delay_data` that you use for the x-axis.

*Assign the output of altair plot to an object called* `departure_delay_plot`.

In [None]:
# delay_data = delay_data.assign(
#     DEPARTURE_DELAY_hr = ___
# )
# departure_delay_plot = ___ 

# your code here
raise NotImplementedError
departure_delay_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(departure_delay_plot.mark)).encode("utf-8")+b"b693f").hexdigest() == "e22cb2b3001cf4a724148a13101b4f1d1a220245", "type of departure_delay_plot.mark is not str. departure_delay_plot.mark should be an str"
assert sha1(str(len(departure_delay_plot.mark)).encode("utf-8")+b"b693f").hexdigest() == "1018c8320c0276c6265542b9c4badbed40e05865", "length of departure_delay_plot.mark is not correct"
assert sha1(str(departure_delay_plot.mark.lower()).encode("utf-8")+b"b693f").hexdigest() == "736ac9e3b4ddaabca2dc701c66bc605dc20fcd3b", "value of departure_delay_plot.mark is not correct"
assert sha1(str(departure_delay_plot.mark).encode("utf-8")+b"b693f").hexdigest() == "736ac9e3b4ddaabca2dc701c66bc605dc20fcd3b", "correct string value of departure_delay_plot.mark but incorrect case of letters"

assert sha1(str(type(departure_delay_plot.encoding.x['title'])).encode("utf-8")+b"b6940").hexdigest() == "5ff7fc5c97feec4f115c4c2a5eac8bf6820a8515", "type of departure_delay_plot.encoding.x['title'] is not str. departure_delay_plot.encoding.x['title'] should be an str"
assert sha1(str(len(departure_delay_plot.encoding.x['title'])).encode("utf-8")+b"b6940").hexdigest() == "2ec63e87919e5fcf730bc49e4d2a9d17e819f39f", "length of departure_delay_plot.encoding.x['title'] is not correct"
assert sha1(str(departure_delay_plot.encoding.x['title'].lower()).encode("utf-8")+b"b6940").hexdigest() == "073a6a1869d77a63166aedc9a4bf73765814cadc", "value of departure_delay_plot.encoding.x['title'] is not correct"
assert sha1(str(departure_delay_plot.encoding.x['title']).encode("utf-8")+b"b6940").hexdigest() == "99ef23b31517d96fcd5ae93945de5b9413ae724d", "correct string value of departure_delay_plot.encoding.x['title'] but incorrect case of letters"

assert sha1(str(type(departure_delay_plot.encoding.y['title'])).encode("utf-8")+b"b6941").hexdigest() == "a25cc41c44b77341969d48a9367768a82338f742", "type of departure_delay_plot.encoding.y['title'] is not str. departure_delay_plot.encoding.y['title'] should be an str"
assert sha1(str(len(departure_delay_plot.encoding.y['title'])).encode("utf-8")+b"b6941").hexdigest() == "df0d9d69a983068c009e2d7da481f2528e618da5", "length of departure_delay_plot.encoding.y['title'] is not correct"
assert sha1(str(departure_delay_plot.encoding.y['title'].lower()).encode("utf-8")+b"b6941").hexdigest() == "1417fc203f6875be68704d2674c6bb090b75f564", "value of departure_delay_plot.encoding.y['title'] is not correct"
assert sha1(str(departure_delay_plot.encoding.y['title']).encode("utf-8")+b"b6941").hexdigest() == "8dba3dfaecbaea977d68af1adbbbb7dd7aeeb214", "correct string value of departure_delay_plot.encoding.y['title'] but incorrect case of letters"

assert sha1(str(type(departure_delay_plot.encoding.x['shorthand'])).encode("utf-8")+b"b6942").hexdigest() == "ac59372f86033a174399fe9bd607e5c70bec35a1", "type of departure_delay_plot.encoding.x['shorthand'] is not str. departure_delay_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(departure_delay_plot.encoding.x['shorthand'])).encode("utf-8")+b"b6942").hexdigest() == "089b42a13f61ac66acbbf9365cf7f72cb4736120", "length of departure_delay_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(departure_delay_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"b6942").hexdigest() == "38f467945dd67da257dc34f7f75a03fdc3a6b23d", "value of departure_delay_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(departure_delay_plot.encoding.x['shorthand']).encode("utf-8")+b"b6942").hexdigest() == "45990de46bce33ce898af8e26c3e7a63f02c04ec", "correct string value of departure_delay_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(departure_delay_plot.encoding.y['shorthand'])).encode("utf-8")+b"b6943").hexdigest() == "a699220585971997a00a2241995c52bc49e08081", "type of departure_delay_plot.encoding.y['shorthand'] is not str. departure_delay_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(departure_delay_plot.encoding.y['shorthand'])).encode("utf-8")+b"b6943").hexdigest() == "82e0ddc4ff3113bee8c2cc50e702118ad2dec4b6", "length of departure_delay_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(departure_delay_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"b6943").hexdigest() == "04d71e357b0f66e02ad91f50cea06eca7277f0fb", "value of departure_delay_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(departure_delay_plot.encoding.y['shorthand']).encode("utf-8")+b"b6943").hexdigest() == "f41d01610289127b8183978b1de2b6e78b1cf7c8", "correct string value of departure_delay_plot.encoding.y['shorthand'] but incorrect case of letters"

print('Success!')

**Question 4.7**
<br> {points: 1}

Look at the two plots you generated. Are departures from or arrivals to `BOS` more likely to be on time? Note that "on time" is defined as being at most 15 minutes ahead or behind the scheduled time.

*Hint: Remember that each bin is 15 min (0.25 hours) wide.*

_Assign your answer (either `"departures"` or `"arrivals"`) to an object called `answer4_7`._

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer4_7)).encode("utf-8")+b"c3d34").hexdigest() == "9ab9755738e3e4b5646c5357d5e0ed32ccd3328b", "type of answer4_7 is not str. answer4_7 should be an str"
assert sha1(str(len(answer4_7)).encode("utf-8")+b"c3d34").hexdigest() == "d563aa33f2aecea0f1d7bc21278a86669739b51c", "length of answer4_7 is not correct"
assert sha1(str(answer4_7.lower()).encode("utf-8")+b"c3d34").hexdigest() == "b46a5948200312f0fc13e9dcc8408d28c7e0320c", "value of answer4_7 is not correct"
assert sha1(str(answer4_7).encode("utf-8")+b"c3d34").hexdigest() == "b46a5948200312f0fc13e9dcc8408d28c7e0320c", "correct string value of answer4_7 but incorrect case of letters"

print('Success!')

**Question 4.8**
<br>{points: 1}

Use the `to_csv` method to write the dataframe to a file called `delay_data.csv`. Save the file in the `data/` folder and specify `index=False` to avoid including the numerical index in the file.

*Note: there are many possible ways to use `to_csv` to customize the output. Just use the defaults here!*

In [None]:
# If you don't know how to call collect or write_csv, use this cell to
# check the documentation by calling ?pd.DataFrame.to_csv

In [None]:
# pd.DataFrame.to_csv(___, ___)

# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(os.path.exists('data/delay_data.csv'))).encode("utf-8")+b"89e0").hexdigest() == "392caca74352c591bc6016018a9af09f9edc433a", "type of os.path.exists('data/delay_data.csv') is not bool. os.path.exists('data/delay_data.csv') should be a bool"
assert sha1(str(os.path.exists('data/delay_data.csv')).encode("utf-8")+b"89e0").hexdigest() == "0beb3da636226fa40eee8fafe0358a7bd7618b46", "boolean value of os.path.exists('data/delay_data.csv') is not correct"

assert sha1(str(type(pd.read_csv('data/delay_data.csv').sum())).encode("utf-8")+b"89e1").hexdigest() == "92540987bfa75944ca400b1823e34b0b363e35bc", "type of pd.read_csv('data/delay_data.csv').sum() is not correct"
assert sha1(str(pd.read_csv('data/delay_data.csv').sum()).encode("utf-8")+b"89e1").hexdigest() == "7eedf7a9ceb6dd23b90ae94cc27ab58dd0bfb842", "value of pd.read_csv('data/delay_data.csv').sum() is not correct"

print('Success!')

In [None]:
try:
    os.remove("data/delay_data.csv")
except:
    None

## 5 (Optional). Reading Data from the Internet

### How has the World Gross Domestic product changed throughout history?


As defined on Wikipedia, the "Gross world product (GWP) is the combined gross national product of all the countries in the world." Living in our modern age with our roaring (sometimes up and sometimes down) economies, one might wonder how the world economy has changed over history. To answer this question we will scrape data from the [Wikipedia Gross world product page](https://en.wikipedia.org/wiki/Gross_world_product).

Your data set will include the following columns: 
* `year`
* `gwp_value`

Specifically we will scrape the 2 columns named "Year" and "Real GWP" in the table under the header "Historical and prehistorical estimates". **The end goal of this exercise is to create a line plot with year on the x-axis and GWP value on the y-axis.**

**Question 5.1.0** Multiple Choice: 
<br> {points: 0}

Under which of the following headers in the table will we scrape from on the [Wikipedia Gross world product page](https://en.wikipedia.org/wiki/Gross_world_product)?

A. Gross world product

B. Recent growth

C. Historical and prehistorical estimates

D. See also

*Assign your answer to an object called `answer5_1_0`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer5_1_0)).encode("utf-8")+b"24f05").hexdigest() == "67099472f656e2c6f390dd68b40cbc57e2ba1605", "type of answer5_1_0 is not str. answer5_1_0 should be an str"
assert sha1(str(len(answer5_1_0)).encode("utf-8")+b"24f05").hexdigest() == "ba16b1f03083eb74b1c277ac0009a8d131717463", "length of answer5_1_0 is not correct"
assert sha1(str(answer5_1_0.lower()).encode("utf-8")+b"24f05").hexdigest() == "e21122aeb1b215d5fa87518542268ae60a7da041", "value of answer5_1_0 is not correct"
assert sha1(str(answer5_1_0).encode("utf-8")+b"24f05").hexdigest() == "680d138a68f1db1fd309b4f69993297dac4f3434", "correct string value of answer5_1_0 but incorrect case of letters"

print('Success!')

**Question 5.1.1** Multiple Choice: 
<br> {points: 0}

What is going to be the x-axis of the scatter plot we create?

A. compound annual growth rate

B. the value of the gross world product

C. year

*Assign your answer to an object called `answer5_1_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer5_1_1)).encode("utf-8")+b"bddc5").hexdigest() == "a0ee509767e053e1d662505411e83da62e945cd0", "type of answer5_1_1 is not str. answer5_1_1 should be an str"
assert sha1(str(len(answer5_1_1)).encode("utf-8")+b"bddc5").hexdigest() == "5bac244e4fff786bbec2bfcd2b0acd386630e5fc", "length of answer5_1_1 is not correct"
assert sha1(str(answer5_1_1.lower()).encode("utf-8")+b"bddc5").hexdigest() == "9126e40c031f381754e8f83fd451d1b3cdded764", "value of answer5_1_1 is not correct"
assert sha1(str(answer5_1_1).encode("utf-8")+b"bddc5").hexdigest() == "afe4c423df10a90e712d6c5826c99750b2e7dbda", "correct string value of answer5_1_1 but incorrect case of letters"

print('Success!')

We need to now load the `BeautifulSoup` and `requests` package to begin our web scraping!

In [None]:
# Run this cell
import requests
from bs4 import BeautifulSoup

**Question 5.2**
<br> {points: 0}

To download information from the URL, we should create a `BeautifulSoup` object using the url given in the cell below. We need to use the `.get` function from `request` package to get the url and then pass it into `BeautifulSoup` function.

*Assign your answer to an object called `gwp`.*

In [None]:
url = "https://en.wikipedia.org/wiki/Gross_world_product"

# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(gwp.__module__ == 'bs4')).encode("utf-8")+b"12bf4").hexdigest() == "4537d87aeedc94d23c25bc588e7f34f6bfe4f5e3", "type of gwp.__module__ == 'bs4' is not bool. gwp.__module__ == 'bs4' should be a bool"
assert sha1(str(gwp.__module__ == 'bs4').encode("utf-8")+b"12bf4").hexdigest() == "1f391b3fbe68a92825be67bc47e8fd2d19225754", "boolean value of gwp.__module__ == 'bs4' is not correct"

assert sha1(str(type(len(gwp.find_all("table")))).encode("utf-8")+b"12bf5").hexdigest() == "6a8802dc2f757db97fac98c0c0f11db5355cb5a3", "type of len(gwp.find_all(\"table\")) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(len(gwp.find_all("table"))).encode("utf-8")+b"12bf5").hexdigest() == "2f2c844675507ed0d7e22eca53462ce397040139", "value of len(gwp.find_all(\"table\")) is not correct"

print('Success!')

**Question 5.3**

Run the cell below to create the first column of your data set (the year from the table under the "Historical and prehistorical estimates" header). The node was obtained using `CSS selector`. 

In [None]:
# Run this cell to create the first column for your data set.

year = pd.DataFrame(
    [
        row.get_text()  # get the content of each rows that extracted from the table
        for row in gwp.select(
            ".wikitable tbody:nth-child(1) td:nth-child(1)"
        )  # get each rows in the table
    ],
    columns=["year"],  # set the column name
)
year.head()

We can see that although we want numbers for the year, the data we scraped includes the characters `CE` and `\n` (a newline character). We will have to do some string manipulation and then convert the years from characters to numbers. 

First we use the `replace` function to match the regex representation for string `" CE\n"` and replace it with nothing `""`:

In [None]:
year = year.replace(" CE\n", "", regex=True)
year.head()

When we print year, we can see we were able to remove `" CE\n"`, but we missed that there is also `" BCE\n"` on the earliest years! There are also commas (`","`) in the large BC years that we will have to remove. We also need to put a `-` sign in front of the BC numbers so we don't confuse them with the AD numbers after we convert everything to numbers. To do this we will need to use a similar strategy to clean this all up! 

This week we will provide you the code to do this cleaning, next week you will learn to do these kinds of things yourself. After we do all the string/text manipulation then we use the `astype(int)` function to convert the text to integers.

In [None]:
# select the lines containing " BC\n" and put a - at the beginning of them.

year[year["year"].str.contains(" BCE\n", regex=True)] = "-" + year[
    year["year"].str.contains(" BCE\n", regex=True)
].replace(" BCE\n", "", regex=True)


year

In [None]:
# Replace all commas with nothing and change them to integers
year = year.replace(",", "", regex=True).astype(int)
year

**Question 5.4**
<br> {points: 0}

Create a new column for the gross world product (GWP) from the table we are scraping. Don't forget to use `CCS selector` to obtain the CSS selector needed to scrape the GWP values from the table we are scraping. Assign your answer to an object called `gwp_value`. 

Fill in the `___` in the cell below. 

Refer to **Question 5.3** and don't be afraid to ask for help. 

In [None]:
# ___ = pd.DataFrame(
#     [
#         row.get_text()
#         for row in gwp.select(___)
#     ],
#     columns=[___],
# )

# your code here
raise NotImplementedError
gwp_value.head()

In [None]:
from hashlib import sha1
assert sha1(str(type(gwp_value.dtypes)).encode("utf-8")+b"de6c").hexdigest() == "25ed46e8bd2a7f3decfd7fe84ab61ca9a0b80a1b", "type of gwp_value.dtypes is not correct"
assert sha1(str(gwp_value.dtypes).encode("utf-8")+b"de6c").hexdigest() == "42b6cefa6b1b194e5fe73b0e803796dabb922a6b", "value of gwp_value.dtypes is not correct"

print('Success!')

Again, looking at the output of `gwp_value.head()` we see we have some cleaning and type conversions to do. We need to remove the commas, the extraneous trailing information in the first 3 columns, and the `"\n"` character again. We provide the code to do this below:

In [None]:
# Run this cell to clean up the year data and convert it to a number.

# Replace all commas with nothing
gwp_value = gwp_value.replace(",", "", regex=True)

# Extract numerics and change strings to numeric.
gwp_value["gwp_value"] = gwp_value["gwp_value"].str.extract(r"([0-9.]+)").astype(float)

gwp_value.head()

**Question 5.5**
<br> {points: 0}

Use the `pandas` `concat` function to create a data frame named `gwp` with `year` and `gwp_value` as columns. The general form for the creating data frames from dataframes horizontally using the `concat` function is as follows (`axis=1` means we are doing the concatenation horizontally):

```pd.concat([COLUMN1_NAME, COLUMN2_NAME, COLUMN3_NAME, ...], axis=1)```


In [None]:
# ___ = pd.concat(___, axis=1)

# your code here
raise NotImplementedError
gwp.head()

In [None]:
from hashlib import sha1
assert sha1(str(type(gwp)).encode("utf-8")+b"cef6e").hexdigest() == "102c2ed720fae9b0ba46d944ed8f9c7b05cf8ea7", "type of type(gwp) is not correct"

assert sha1(str(type("year" in gwp.columns)).encode("utf-8")+b"cef6f").hexdigest() == "0ae5457110ce5d6b9dc5a040c9e59c9d6466c9cd", "type of \"year\" in gwp.columns is not bool. \"year\" in gwp.columns should be a bool"
assert sha1(str("year" in gwp.columns).encode("utf-8")+b"cef6f").hexdigest() == "feecdd2ba0a233f63e8435f30736e998b4256ada", "boolean value of \"year\" in gwp.columns is not correct"

assert sha1(str(type("gwp_value" in gwp.columns)).encode("utf-8")+b"cef70").hexdigest() == "a1ae66d310d733f91eb77e221a086ec4675160dc", "type of \"gwp_value\" in gwp.columns is not bool. \"gwp_value\" in gwp.columns should be a bool"
assert sha1(str("gwp_value" in gwp.columns).encode("utf-8")+b"cef70").hexdigest() == "b49c9f84a4a86e1a248a4eb043a6f0d47e564e3f", "boolean value of \"gwp_value\" in gwp.columns is not correct"

print('Success!')

One last piece of data transformation/wrangling we will do before we get to data visualization is to create another column called `sqrt_year` which scales the year values so that they will be more informative when we plot them (if you look at our year data we have a lot of years in the recent past, and fewer and fewer as we go back in time). Often times you can just transform the scale within `altair` (for example see what we do with the `gwp_value` later on), but the year value is tricky for scaling because it contains negative values. So we need to first make everything positive, then take the square root, and then re-transform the values that should be negative to negative again! We provide the code to do this below.

To get the square root for a dataframe, we could use the `np.sqrt` function from the `numpy` package. And we could use `np.where` function from the `numpy` package to wrangle the data based on its values. Therefore, we would need to import the `numpy` package first. 

In [None]:
import numpy as np

In [None]:
gwp = gwp.assign(
    sqrt_year=np.where(year < 0, np.sqrt(abs(year)) * -1, np.sqrt(abs(year)))
)
gwp.head()

**Question 5.6**
<br> {points: 0}

Create a line plot using the `gwp` data frame where `sqrt_year` is on the x-axis and `gwp_value` is on the y-axis. Name your plot object `gwp_historical`. To make a line plot instead of a scatter plot you should use the `mark_line` function instead of the `mark_point` function. *Note that we provide the plot code to relabel the x-axis with the human understandable years instead of the tranformed ones we plot.*

In [None]:
# ___ = alt.Chart(gwp).mark_line().encode(
#     x=alt.X("___").
#         .title("___")
#         .axis(values=[-1000, -750, -500, -250, -77.7, 0, 38.7]),
#     y=alt.Y("___")
#         .scale(type="log", base=10),
#         .title("Gross World Domestic Product ($ billions)")
# )

# your code here
raise NotImplementedError

gwp_historical

In [None]:
from hashlib import sha1
assert sha1(str(type(gwp_historical.mark)).encode("utf-8")+b"53ee1").hexdigest() == "96e59696c289b2d49160c0dc4e93d1a045186e1e", "type of gwp_historical.mark is not str. gwp_historical.mark should be an str"
assert sha1(str(len(gwp_historical.mark)).encode("utf-8")+b"53ee1").hexdigest() == "cb44ff88a29f1b9446c5f6bccff2af79337aa8c3", "length of gwp_historical.mark is not correct"
assert sha1(str(gwp_historical.mark.lower()).encode("utf-8")+b"53ee1").hexdigest() == "71159f244cdf36b931300d58bd3364c0191662d1", "value of gwp_historical.mark is not correct"
assert sha1(str(gwp_historical.mark).encode("utf-8")+b"53ee1").hexdigest() == "71159f244cdf36b931300d58bd3364c0191662d1", "correct string value of gwp_historical.mark but incorrect case of letters"

assert sha1(str(type(isinstance(gwp_historical.encoding.x['title'], str))).encode("utf-8")+b"53ee2").hexdigest() == "9e905e1f201a07f920de1f1bf9fc7ad20335f564", "type of isinstance(gwp_historical.encoding.x['title'], str) is not bool. isinstance(gwp_historical.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(gwp_historical.encoding.x['title'], str)).encode("utf-8")+b"53ee2").hexdigest() == "d5256f85118552054613cef9a1cf2f9cfdfc768e", "boolean value of isinstance(gwp_historical.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(gwp_historical.encoding.y['title'], str))).encode("utf-8")+b"53ee3").hexdigest() == "f7f95106a7cd3932ac84dcbfbe6db5a6618d4cc7", "type of isinstance(gwp_historical.encoding.y['title'], str) is not bool. isinstance(gwp_historical.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(gwp_historical.encoding.y['title'], str)).encode("utf-8")+b"53ee3").hexdigest() == "0001fb3b01f0b5cb28c898297110c9fd2a62ec48", "boolean value of isinstance(gwp_historical.encoding.y['title'], str) is not correct"

print('Success!')

**Question 5.7** 
<br> {points: 0}

Looking at the line plot, when does the Gross World Domestic Product first start to more rapidly increase (i.e., when does the slope of the line first change)? 

A. roughly around year -1,000,000

B. roughly around year -250,000

C. roughly around year -5000

D. roughly around year 1500


*Assign your answer to an object called `answer5_7`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer5_7)).encode("utf-8")+b"4f98").hexdigest() == "ab9b7411b8a1cee4789637b0e68e1ac7ade1678f", "type of answer5_7 is not str. answer5_7 should be an str"
assert sha1(str(len(answer5_7)).encode("utf-8")+b"4f98").hexdigest() == "23768b402d25bf597871f08fd8d46e9855f13fc3", "length of answer5_7 is not correct"
assert sha1(str(answer5_7.lower()).encode("utf-8")+b"4f98").hexdigest() == "c1d3a049a71954c0cfba194ffddb8b2996714ec6", "value of answer5_7 is not correct"
assert sha1(str(answer5_7).encode("utf-8")+b"4f98").hexdigest() == "bc64da73de0d4f2403c30c66e1b8da32dcac4e7a", "correct string value of answer5_7 but incorrect case of letters"

print('Success!')

In [None]:
try:
    os.remove("data/delay_data.csv")
except:
    None