# Programming in Python for Data Science 

# Assignment 6: Functions Fundamentals and Best Practices

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Evaluate the readability, complexity and performance of a function.
- Write docstrings for functions following the NumPy/SciPy format.
- Write comments within a function to improve readability.
- Write and design functions with default arguments.
- Explain the importance of scoping and environments in Python as they relate to functions.
- Formulate test cases to prove a function design specification.
- Use `assert` statements to formulate a test case to prove a function design specification.
- Use test-driven development principles to define a function that accepts parameters, returns values and passes all tests.
- Handle errors gracefully via exception handling.

This assignment covers [Module 6](https://prog-learn.mds.ubc.ca/en/module6) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` and the `raise NotImplementedError # No Answer - remove if you provide an answer` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [1]:
# Import libraries needed for this lab
import pandas as pd
import random
import test_assignment6 as t
import altair as alt
import string
import inspect
from hashlib import sha1

## 1.  Writing functions

Here we have the `astronauts.csv` data we have used in previous assignments.

In [2]:
data = pd.read_csv('data/astronauts.csv')
data.head()

Unnamed: 0,Name,Year,Group,Status,Birth Date,Birth Place,Gender,Alma Mater,Undergraduate Major,Graduate Major,Military Rank,Military Branch,Space Flights,Space Flight (hr),Space Walks,Space Walks (hr),Missions,Death Date,Death Mission
0,Joseph M. Acaba,2004,19.0,Active,5/17/1967,"Inglewood, CA",Male,University of California-Santa Barbara; Univer...,Geology,Geology,,,2,3307,2,13.0,"STS-119 (Discovery), ISS-31/32 (Soyuz)",,
1,James C. Adamson,1984,10.0,Retired,3/3/1946,"Warsaw, NY",Male,US Military Academy; Princeton University,Engineering,Aerospace Engineering,Colonel,US Army (Retired),2,334,0,0.0,"STS-28 (Columbia), STS-43 (Atlantis)",,
2,Thomas D. Akers,1987,12.0,Retired,5/20/1951,"St. Louis, MO",Male,University of Missouri-Rolla,Applied Mathematics,Applied Mathematics,Colonel,US Air Force (Retired),4,814,4,29.0,"STS-41 (Discovery), STS-49 (Endeavor), STS-61 ...",,
3,Buzz Aldrin,1963,3.0,Retired,1/20/1930,"Montclair, NJ",Male,US Military Academy; MIT,Mechanical Engineering,Astronautics,Colonel,US Air Force (Retired),2,289,2,8.0,"Gemini 12, Apollo 11",,
4,Andrew M. Allen,1987,12.0,Retired,8/4/1955,"Philadelphia, PA",Male,Villanova University; University of Florida,Mechanical Engineering,Business Administration,Lieutenant Colonel,US Marine Corps (Retired),3,906,0,0.0,"STS-46 (Atlantis), STS-62 (Columbia), STS-75 (...",,


We have some code that randomly samples and selects a given number of rows from each of the specified column's groups. 
In this case it is randomly selecting 3 astronauts from each possible `Status` (`Active`, `Deceased`, `Management`, `Retired`)

In [3]:
data = pd.read_csv('data/astronauts.csv')
df_grouped = data.groupby('Status')

sampled_df = None

for group, rows in df_grouped: 
    group_sampling =  df_grouped.get_group(group).sample(2)
    sampled_df = pd.concat([sampled_df, group_sampling])
    
sampled_df

Unnamed: 0,Name,Year,Group,Status,Birth Date,Birth Place,Gender,Alma Mater,Undergraduate Major,Graduate Major,Military Rank,Military Branch,Space Flights,Space Flight (hr),Space Walks,Space Walks (hr),Missions,Death Date,Death Mission
315,Peggy A. Whitson,1996,16.0,Active,2/9/1960,"Mt. Ayr, IA",Female,Iowa Wesleyan College; Rice University,Chemistry & Biology,Biochemistry,,,3,11698,7,46.0,"STS-111/113 (Endeavor), ISS-16 (Soyuz), ISS-50...",,
165,Gregory H. Johnson,1998,17.0,Active,5/12/1962,"London, England",Male,US Air Force Academy; Columbia University; Uni...,Aeronautical Engineering,Flight Structures Engineering; Business Admini...,Colonel,US Air Force (Retired),2,755,0,0.0,"STS-123 (Endeavor), STS-134 (Endeavor)",,
294,Stephen D. Thorne,1985,11.0,Deceased,2/11/1953,"Frankfurt, West Germany",Male,US Naval Academy,Engineering,,Lieutenant Commander,US Navy,0,0,0,0.0,,5/24/1986,
242,Alan G. Poindexter,1998,17.0,Deceased,11/5/1961,"Pasadena, CA",Male,Georgia Institute of Technology; US Naval Post...,Aerospace Engineering,Aeronautical Engineering,Captain,US Navy,2,669,0,0.0,"STS-122 (Atlantis), STS-131 (Discovery)",7/1/2012,
30,Charles F. Bolden Jr.,1980,9.0,Management,8/19/1946,"Columbia, SC",Male,US Naval Academy; University of Southern Calif...,Electrical Science,Systems Management,Major General,US Marine Corps (Retired),4,680,0,0.0,"STS-61C (Columbia), STS-31 (Discovery), STS-45...",,
47,Yvonne D. Cagle,1996,16.0,Management,4/24/1959,"West Point, NY",Female,San Francisco State University,Biochemistry,,Colonel,US Air Force,0,0,0,0.0,,,
12,Lee J. Archambault,1998,17.0,Retired,8/25/1960,"Oak Park, IL",Male,University of Illinois-Urbana,Aeronautical & Astronautical Engineering,Aeronautical & Astronautical Engineering,Colonel,US Air Force,2,639,0,0.0,"STS-117 (Atlantis), STS-119 (Discovery)",,
327,Alfred M. Worden,1966,5.0,Retired,2/7/1932,"Jackson, MI",Male,US Military Academy; University of Michigan,Military Science,Aeronautical & Astronautical Engineering,Colonel,US Air Force (Retired),1,295,1,0.5,Apollo 15,,


**Question 1(a)** <br> {points: 3} 

Use the code above to write a function named `sample_dataframe` that randomly samples from any dataframe, N observations from each specified group in a dataframe. 
The function should accept the following arguments:
- The dataframe (`data`)
- The name of the grouping column (`grouping_col`) 

This function should have a default argument of 1 for N.

We have provided you code that executes your function using the `astronauts.csv` dataframe, the grouping column `Group` and a value of 1 for the number of observations to sample. 
The output of this is saved in an object named `astro_grp_samp`.

_**DISCLAIMER:** We understand that one of the limitations of the following dataset is that it reflects binary sex categories._

_Notes:_
- *See this link on[`.sample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) to learn more about how it samples a dataframe*

In [20]:

def sample_dataframe(data, grouping_col, N=1):
    return data.groupby(grouping_col).sample(N)



astro_grp_samp = sample_dataframe(data, 'Group')

In [21]:
t.test_1a(sample_dataframe)

'Success'

**Question 1(b)** <br> {points: 3} 

Write a function named `df_filterer` that filters rows matching an exact value in the column of interest, selects specific columns and returns a dataframe.  

The function should accept the following arguments:
- The dataframe (`data`)
- The name of the column of interest (`interest_column`)
- The value to filter for. (`value`)
- The desired columns to select (input type should be a list) (`keep`) 

Make sure that your function is returning the transformed dataframe. 


In [41]:

def df_filterer(data, interest_column, value, keep):
    return data[data[interest_column] == value][keep]

princeton_ast = df_filterer(data,'Alma Mater','University of Kansas', ['Name'])
princeton_ast

Unnamed: 0,Name
93,Joe H. Engle


In [42]:
t.test_1b(df_filterer,data)

'Success'

## 2.  Writing Docstrings

**Question 2(a)** <br> {points: 1} 

Copy/paste your function from **Question 1(a)**, and then improve it by adding a docstring. 

In [37]:
def sample_dataframe(data, grouping_col, N=1):
    '''
    Returns a random sample of a dataframe.
    
    Returns a random sample of a dataframe given a column to group by and
    a number of items to return (default: 1).
    
    Parameters
    ----------
    data : pandas.DataFrame
        The dataframe to sample
    grouping_col : mapping, function, label, or list of labels
        The column to group by
    N : int, optional
        Number of items to return
    
    Returns
    -------
    DataFrame
        DataFrame representing a sample of the original dataframe (data)
        
    Examples
    --------
    >>> sample_dataframe(data, 'Group')
    
    '''
    return data.groupby(grouping_col).sample(N)

In [38]:
t.test_2a(sample_dataframe)

'Success'

**Question 2(b)** <br> {points: 2} 

Copy/paste your function from **Question 1(b)**, and then improve it by adding a docstring. 

In [45]:
def df_filterer(data, interest_column, value, keep):
    '''
    Filters rows matching an exact value in the column of interest.
    
    Filters rows matching an exact value in the column of interest, 
    selects specific columns and returns a dataframe.
    
    Parameters
    ----------
    data : pandas.DataFrame
        The dataframe to sample
    interest_column : str
        The column to group by
    value : str
        The exact value to filter for
    keep : label, list or array of labels, slice object with labels, boolean array of the same length as the axis being sliced, An alignable boolean Series, A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
        
    
    Returns
    -------
    DataFrame
        DataFrame representing a filtered version of the original dataframe (data)
        
    Examples
    --------
    >>> df_filterer(data,'Alma Mater','University of Kansas', ['Name'])
    
    '''
    return data[data[interest_column] == value][keep]

In [46]:
# check that the function exists
assert 'df_filterer' in globals(
), "Please make sure that your solution is named 'df_filterer'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

## 3. Function with Exceptions 

**Question 3(a)** <br> {points: 1} 

Copy/paste your function from **Question 1(b)**, and add exceptions to check that:

1. A dataframe is the type of object being passed into the `data` argument (*Hint: you may want to use [this function](https://python-reference.readthedocs.io/en/latest/docs/functions/isinstance.html)*)
1. `value` exists in the dataframe (`.tolist()` may come in handy here) 
1. `interest_column` exists in the dataframe
1. The elements in `keep` exist in the dataframe

In [79]:
def df_filterer(data, interest_column, value, keep):
    if not isinstance(data, pd.DataFrame):
        raise TypeError('data must be a dataframe')
    if not value in data.values:
        raise ValueError('value does not exist in data')
    if not interest_column in data.columns:
        raise Exception('interest_column does not exist in data')
    if not data.squeeze().isin(keep):
        raise Exception('keep does not exist in data')
    
    return data[data[interest_column] == value][keep]

In [80]:
t.test_3a(df_filterer)

'Success'

**Question 3(b)** <br> {points: 1} 

Copy/paste your function from **Question 1(a)**, and add at least 3 useful exceptions of your choice.

In [84]:

def sample_dataframe(data, grouping_col, N=1):
    if not isinstance(data, pd.DataFrame):
        raise TypeError('data must be a dataframe')
    if not grouping_col in data.columns:
        raise Exception('grouping_col does not exist in data')
    if N > 1 and not isinstance(data, int):
        raise TypeError('N must be of type int')
    
    return data.groupby(grouping_col).sample(N)

astro_grp_samp = sample_dataframe(data, 'Group')

In [85]:
t.test_3b(sample_dataframe)

'Success'

## 4. Helper Data and Unit Tests

**Question 4(a)** <br> {points: 1} 

Write helper data for the function in **Question 1(a)** that will be useful to write unit tests. 
Name the dataframe `helper_data`.

Make sure your data has 5-20 rows and 3-10 columns.   

*(Remember you are expected to group and sample from this function.)* 



In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

set(list(helper_data.dtypes))

In [None]:
t.test_4a(helper_data)

**Question 4(b)** <br> {points: 2} 

Create a function named `test_sample_dataframe` which takes zero arguments. The function should contain the code to make the helper data from **Question 4(a)**. Also in this function, write 5 `assert` tests that checks your function from **Question 1(a)** using the helper data that you made in **Question 4(a)**.

After writing your function, make sure to call it and see if your function outputs any assert messages.

Make sure to include a `return` statement in your function. Your function should not return any values.

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

test_sample_dataframe()

In [None]:
# check that the function exists
assert 'test_sample_dataframe' in globals(
), "Please make sure that your solution is named 'test_sample_dataframe'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

## 5. Function Design

**Question  5(a)** <br> {points: 1} 

Below we have a function that takes zero arguments and returns the astronaut dataframe filtered to only include astronauts who have had over 6000 hours of space flight.

_**DISCLAIMER:** We understand that one of the limitations of the following dataset is that it reflects binary sex categories._

In [None]:
def load_astronauts(): 
    """
    Reads in the astronaut data and filters it for space flight time
    greater than 6000
    
    Returns
    -------
    pandas.core.frame.DataFrame
        The filtered astronaut dataframe 
    
    Examples
    --------
    >>> load_astronauts()
    """
    
    df = pd.read_csv('data/astronauts.csv')
    df = df[df['Space Flight (hr)'] >= 6000]
    return df

In [None]:
space = load_astronauts()
space

What is wrong with the function `load_astronauts()`?


A) It doesn't take any arguments which is not good function design. 

B) It's attempting to do too many things by reading in the data AND filtering on `Space Flight (hr)`

C) It contains side effects that could easily be removed. 

D) It limits the user to only use the function to obtain astronauts with a hard-coded amount of `'Space Flight (hr)'` time. 




*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer5_a`.*

In [None]:
answer5_a = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_5a(answer5_a)

**Question  5(b)** <br> {points: 2} 

Given the function above, write a new similar function named `astronauts_space_time` that takes in an argument and corrects for the issue you identified above. 

Remember to add a `docstring` in your function.

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the function exists
assert 'astronauts_space_time' in globals(
), "Please make sure that your solution is named 'astronauts_space_time'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question  5(c)** <br> {points: 1} 

The function `astronaut_full_service` reads in the astronaut dataframe and prints multiple calculations and returns a plot. 

In [None]:
def astronaut_full_service(status=None): 
    """
    Reads in the astronaut data, and potentially filters the data based on status.
    It prints out the mean space flight time and then plots it as a histogram with 
    both genders and plots both genders seperately. 

    
    Parameters
    ----------
    status : str, optional
        The status of an astronaut; Active, Deceased, Retired, etc (the default is None which implies 
        that no filtering is occuring) 
    
    Returns
    -------
    altair.vegalite.v4.api.FacetChart
        A histogram faceted for gender of the Space flight time. 
    
    Examples
    --------
    >>> astronaut_full_service('Retired')
    Alt.Chart(...)
    """
    
    df = pd.read_csv('data/astronauts.csv')
    
    if status is not None: 
        df = df[df['Status'] == status]

    mean_flight =  df['Space Flight (hr)'].mean()
    
    print('Mean Space Walk Time:', mean_flight) 
    
    
    plot1 = alt.Chart(df).mark_bar(size=40, color = 'tomato').encode(
            alt.X('Space Flights:Q'),
            alt.Y('count()'))
    
    plot1.display()
    
    plot2 = alt.Chart(df).mark_bar(size=40).encode(
            alt.X('Space Flights:Q'),
            alt.Y('count()'),
            color=alt.Color('Gender',
                   scale=alt.Scale(
            domain=['Male', 'Female'],
            range=['Navy', 'tomato']))
    ).facet(alt.Column('Gender:N'))
 
    return plot2

In [None]:
astronaut_full_service('Deceased')

What is the primary issue with function `astronaut_full_service()`?


A) The arguments it accepts are too limited. Having more options for arguments will give the ability to produce plots with more versatility and better insights 

B) There is no way of obtaining the results of `mean_flight`  from this function. The user would need to write additional code to obtain it. 

C) It contains side effects that could easily be removed. 

D) The plots are calling in variables from the global environment. 



*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer5_c`.*

In [None]:
answer5_c = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_5c(answer5_c)

**Question  5(d)** <br> {points: 1} 

Given the function `astronaut_full_service`  write a new function similar function named `astronauts_stats` that  corrects for the issue you identified above. 

Your function should return a single value. 

Make sure to include a `docstring` for you function.

Test it out with `status='Deceased'`. 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5d(astronauts_stats)

**Question  5(e)** <br> {points: 1} 

We have a final function named `filter_astronauts` that uses the astronaut dataset as an argument. It takes multiple arguments and returns a dictionary with 2 dataframes as values


In [None]:
astronauts = pd.read_csv('data/astronauts.csv')

def filter_astronauts(df, military_rank, year_min, year_max): 
    """
    Filters the input argument data based on military rank and the astronauts entry year.

    
    Parameters
    ----------
    df : pandas.core.frame.DataFrame
        The dataframe to filter
    military_rank : str
        The astronaut's military rank if any. If "No Rank", filters for no military ranking. 
    year_min : int
        Astronaut entry year minimum 
    year_max : int
        Astronaut entry year minimum 
    
    Returns
    -------
    dict
        A dictionary containing the 2 dataframes  
    
    Examples
    --------
    >>> filter_astronauts(df, "No Rank", 1996, 2010)

    """
   
    if military_rank == "No Rank": 
        df_military = df[df['Military Rank'].isnull()]
    elif military_rank is None: 
        df_military = df
    else:
        df_military = df[df['Military Rank'] == military_rank]
    
    df_year = df[(df['Year'] >= year_min) & (df['Year'] <= year_max)]
    
    dataframe_dict = {'military_filtered': df_military, 'year_filtered': df_military}
   
    return dataframe_dict

In [None]:
filter_astronauts(astronauts, "No Rank", 1996, 2010 )

Why is the function `filter_astronauts()` not considered the best possible design ?

A) It returns a dictionary which is not good function design. 

B) It's attempting to do too many things and it would be better to have 2 separate functions, one that returns each dataframe. 

C) It contains side effects that could easily be removed. 

D) It limits the user's ability to filter on specific columns. 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called `answer5_e`.*

In [None]:
answer5_e = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_5e(answer5_e)

**Question  5(f)** <br> {points: 1} 

Given the function above, solve the issue that you specified above by making 2 new functions named `filters_military_rank` and `filters_active_years`. 
Your new functions should have applicable arguments from the function in question `5e`.

Make sure to include a `docstring` for your function.

Run your new functions using the same parameters as: 

`filter_astronauts(astronauts, "No Rank", 1996, 2010)`

Save your answers in objects named `astro_no_rank` and `astro_96_10`.

The returned items should be `dataframes`.

In [None]:
astronauts = pd.read_csv('data/astronauts.csv')

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5f(filters_military_rank,filters_active_years,astronauts)

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions
- MDS DSCI 511 - Programming for Data Science - [MDS's GitHub website](https://ubc-mds.github.io/course-descriptions/DSCI_511_prog-dsci/) 
- Astronaut Dataset - [Kaggle](https://www.kaggle.com/nasa/astronaut-yearbook?select=astronauts.csv)

## Module Debriefing

If this video is not showing up below, click on the cell and click the ▶ button in the toolbar above.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('3d5rOf1SEUY', width=854, height=480)