# MCL-DSCI-011 Programming in Python for Data Science 

# Assignment 7: Importing Files and the Coding Style Guide

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).       

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Describe what Python libraries are, as well as explain when and why they are useful.
- Identify where code can be improved concerning variable names, magic numbers, comments and whitespace.
- Write code that is human readable and follows the black style guide.
- Import files from other directories.
- Use [`pytest`](https://docs.pytest.org/en/stable/) to check a function's tests.
- When running [`pytest`](https://docs.pytest.org/en/stable/), explain how pytest finds the associated test functions.
- Explain how the Python debugger can help rectify your code.

This assignment covers [Module 7](https://prog-learn.mds.ubc.ca/en/module7) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct.

In [None]:
# Import libraries needed for this lab
import test_assignment7 as t
from hashlib import sha1

## 1.   Importing libraries   

**Question 1(a)** <br> {points: 1}  

Import the `pandas` library and name it `pd` in the worksheet environment. 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1a(dir())

**Question 1(b)** <br> {points: 1}  

Import the Altair library into the worksheet enviroment. 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1b(dir())

**Question 1(c)** <br> {points: 1}  

From the `numpy` library, only import the `arange()` function using the keywork `from`. 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1c()

## 2. Working with other files  

**Question 2(a)** <br> {points: 1}  

Load in the `chopped.csv` file from the data folder and save it as an object named `chopped`.

In [None]:
chopped = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_2a(chopped)

**Question 2(b)** <br> {points: 1}  

Import the the function `sample_dataframe()` (that we created in Assignment 6) from `sampling.py` 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2b(dir())

**Question 2(c)** <br> {points: 2}  

To refresh yourself on what the function `sample_dataframe()` does, inspect the function docstring.  

Which of the following is the correct way to inspect the docstring of the function `sample_dataframe()`?     
*Hint: Try it out yourself*

A) `?sample.sample_dataframe`

B) `?sample.sample_dataframe()` 

C) `?sample_dataframe`

D) `?sample_dataframe()`


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer2_c`.*


In [None]:
answer2_c = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
# Note that this test has been hidden intentionally.
# It will provide no feedback as to the correctness of your answers.
# Thus, it is up to you to decide if your answer is sufficiently correct.

In [None]:
# Use this cell to obtain the function docstring

**Question 2(d)** <br> {points: 1}  

Based on the docstring, which parameter is optional?       
Answer the parameter name as a `str` in the object `answer2_d`. 

In [None]:
answer2_d = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2d(answer2_d)

**Question 2(e)** <br> {points: 1}  

Based on the docstring, which parameter accepts data types of `str`?      

Answer the parameter name as a `str` in the object `answer2_e`. 

In [None]:
answer2_e = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2e(answer2_e)

**Question 2(f)** <br> {points: 1}  

Sample two rows from each season from the `chopped` dataframe using your function `sample_dataframe`.     

Save this in an object named `chopped_sample`.

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2f(chopped_sample)

## 3. Using Pytest

We have provided you with another file called `test_sampling.py` which contains multiple functions that test if our `sample_dataframe()` function is working properly. 

**Question 3(a)** <br> {points: 1}  

The tests for `sample_dataframe()` are located in a different file than the function which means we will need to import the function from our `sampling.py` file at the top of `test_sampling.py`. 

Open `test_sampling.py` and on line 2, write code to import the `sample_dataframe()` function. 


In [None]:
t.test_3a()

**Question 3(b)** <br>

We are going to do things a little differently then in the lesson here. 
Using `pytest` in a jupyter notebook, we can check if all the tests in `test_sampling.py` pass using the code `!pytest test_sampling.py` in a code cell. 


Try it out in the cell below and answer the following multiple choice questions regarding the results. 

In [None]:
#Use this code chunk to check your tests on the file test_sampling.py using pytest


**Question 3(b-i)** <br> {points: 1}  

How many of the tests from `test_sampling.py` passed?      
*Assign the correct answer to an object called `tests_passed`.*

In [None]:
tests_passed = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_3bi(tests_passed)

**Question 3(b-ii)** <br> {points: 2}  

How many of the tests from `test_sampling.py` failed?      
*Assign the correct answer to an object called `tests_failed`.*

In [None]:
tests_failed = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
# Note that this test has been hidden intentionally.
# It will provide no feedback as to the correctness of your answers.
# Thus, it is up to you to decide if your answer is sufficiently correct.

**Question 3(b-iii)** <br> {points: 1}  

Name a test that did not pass.   
*Assign the correct answer to an object called `failed_name`.*

In [None]:
failed_name = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_3biii(failed_name)

## 4. Black and Flake8 Formatting

**Question 4(a)** <br>

Run Flake8 on our `sampling.py` file in the cell below and answer the questions that follow.


In [None]:
# Use this cell to run flake8

**Question 4(a-i)** <br> {points: 1}  

How many formatting issues did flake8 recognize in the `sampling.py` file?      
*Assign the correct answer to an object called `answer4_ai`.*

In [None]:
answer4_ai = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4ai(answer4_ai)

**Question 4(a-ii)** <br> {points: 1}  

How many `W291 trailing whitespace` issues are there? (We will talk a little bit about trailing and leading white space in Module 8)       
*Assign the correct answer to an object called `answer4_aii`.*

In [None]:
answer4_aii = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4aii(answer4_aii)

**Question 4(a-iii)** <br> {points: 1}  


Which of the following is the formatting issue that occurs on line 37?     


A) `E222 multiple spaces after operator`

B) `W293 blank line contains whitespace`

C) `W291 trailing whitespace`

D) `E251 unexpected spaces around keyword / parameter equals` 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer4_aiii`.*


In [None]:
answer4_aiii = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
sha1(str(answer4_aiii.lower() + '17').encode('utf8')).hexdigest()

In [None]:
t.test_4aiii(answer4_aiii)

**Question 4(b)**  {points: 1}  

Run `black` on our `sampling.py` file in the cell below and answer the questions that follow.


In [None]:
# Use this cell to run black

Which code would you use in a jupyter code cell to rune black?

A) `black sampling.py`

B) `!black sampling.py` 

C) `sampling.black()`

D) `black.sampling()`


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer4_b`.*

In [None]:
answer4_b = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4b(answer4_b)

**Question 4(c)** <br> {points: 2}  

Now that we have reformatted our `sampling.py` file, let's rerun flake8 just as we did before as see how many of our formatting issues have been fixed and answer the question below. 

In [None]:
# Use this cell to run flake8

How many formatting issues are we left with after re-runing flake8 after formatting `sampling.py` using the `black` style guide?

*Assign the correct answer to an object called `answer4_c`.*

In [None]:
answer4_c = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# Note that this test has been hidden intentionally.
# It will provide no feedback as to the correctness of your answers.
# Thus, it is up to you to decide if your answer is sufficiently correct.

## 5. Style Guide - Comments and Variable Names

**Question 5(a)** <br> {points: 1}  

Which of the following names is most fitting for an object that contains a list of column names from a dataframe named `metals`? 

A) `metal_columns`

B) `columnsfrommetaldataframe`

C) `list`

D) `c_metals` 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer5_a`.*


In [None]:
answer5_a = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5a(answer5_a)

**Question 5(b)** <br> {points: 1}  

Which of the following names is the best fitting for object containing a dataframe containing different lightbulb types?

A) `LIGHTBULBS`

B) `dataframe_where_lightbulbs_data_stored`

C) `data`

D) `lightbulb_df` 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer5_b`.*


In [None]:
answer5_b = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5b(answer5_b)

**Question 5(c)** <br> {points: 2}  

Which of the following is NOT a reasonable comment to include in your code?

A) `# Keep this line of code in, or the function will break mysteriously`

B) `# Rename columns to shorter column names`

C) `# This assigns all the values greater than 100 a value of 100.`

D) `# TODO: Fix this next part so it's more readable and doesn't include magic numbers` 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer5_c`.*

In [None]:
answer5_c = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# Note that this test has been hidden intentionally.
# It will provide no feedback as to the correctness of your answers.
# Thus, it is up to you to decide if your answer is sufficiently correct.

**Question 5(d)** <br> {points: 2}  

Below is a function that plots a histogram of a specified quantitative column.
We want you to identify the 4 poorly designed elements within this function, and rewrite/rename them to something that is more appropriate. 

Copy and paste the function into the cell that follows it and then make your desired changes.

*Hint: The function name does not need to be changed* 

In [None]:
import altair as alt


def column_histogram(data, column_name):
    """
    
    Given a dataframe, this function creates a histogram
    of the values from a specified column
    
    Parameters
    ----------
    data : pandas.core.frame.DataFrame
        The dataframe to filter
    column_name : str
        The column values to plot
        
    Returns
    -------
    altair.vegalite.v4.api.Chart 
        the plotted histogram
        
    Examples
    --------
    >>> column_histogram(chopped, "season")
    altair.vegalite.v4.api.Chart 
    """
    
    # This checks if the data variable is of type pd.dataframe
    if not isinstance(data, pd.DataFrame): 
        raise TypeError("The data argument is not of type DataFrame")   
    
    # This area is reserved for an exception which checks the column dtype of column_name, it could be useful
    
    cs = column_name + ":Q"
    
    # This makes a histogram and plots the values of column_name frequency 
    histogram_plot_of_column_name = alt.Chart(data).mark_bar().encode(
                                        alt.X( cs, bin=True),
                                              y='count()',
                                    )
    
    # This function now returns a histogram 
    return histogram_plot_of_column_name
    

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5d(column_histogram)

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions
- UBC's original STAT545 - [Stat545 by Jenny Bryan](https://stat545.com/)
- MDS DSCI 523 - Data Wrangling course - [MDS's GitHub website](hhttps://ubc-mds.github.io/) 
- Chopped Dataset - [Kaggle](https://www.kaggle.com/jeffreybraun/chopped-10-years-of-episode-data)

## Module Debriefing

If this video is not showing up below, click on the cell and click the ▶ button in the toolbar above.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('hBGFNWtYoYw', width=854, height=480)