# Writing Functions in Python

## Introduction

This course is presented by Shayne Miel, Director of Software Engineering at American Efficient. Collaborators are Hillary Green-Lerman and Becca Robbins.

Prerequisite:
- Python Data Science Toolbox (Part 2)

This course is part of these tracks:
- Data Engineer with Python
- Data Scientist with Python
- Python Programmer
- Python Programming

There are no datasets for this course.

## Versions

The course's IDE uses Python 3.9.7 (default, Sep 10 2021, 00:03:59) \[GCC 7.5.0].

This notebook is being written using Python 3.11.1.

## Data Set

| File | Description |
| :--- | :----|
| alice.txt | The complete text of _Alice in Wonderland_ |

## Resources

### Docstrings
- [Python PEP 257 - Docstring Conventions](https://peps.python.org/pep-0257/)
- [reStructuredText Markup](https://devguide.python.org/documentation/markup/)
- [DataCamp Docstrings Tutorial](https://www.datacamp.com/tutorial/docstrings-python)
- [Numpy Style Guide](https://numpydoc.readthedocs.io/en/latest/format.html)
- [Google Style Guide for Comments and Docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)

### inspect Module
- [inspect - Inspect live objects](https://docs.python.org/3/library/inspect.html)

## Imports

Imports are gathered here for clarity and convenience.

In [None]:
import contextlib
import inspect

import numpy as np
import pandas as pd
import time

## Best Practices

### Docstrings

#### Example Docstring (Demonstration)

```python
def split_and_stack(df, new_names):
    """
    Split a DataFrame's columns into two halves and then stack
    them vertically, returning a new DataFrame with 'new_names' as the
    column names.
    
    Args:
        df (DataFrame): The DataFrame to split.
        new_names (iterable of str): The column names for the new DataFrame
    
    Returns:
        DataFrame
    """
    half = int(len(df.columns) / 2)
    left = df.iloc[:, :half]
    right = df.iloc[:, half:]
    return pd.DataFrame(
        data=np.vstack([left.values, right.values]),
        columns=new_names
    )
```

#### Anatomy of a Docstring (Demonstration)

```python
def function_name(arguments):
    """
    Description of what the function does.
    
    Description of the arguments, if any.
    
    Description of the return values, if any.
    
    Description of errors raised, if any.
    
    Optional extra notes or examples of usage.
    """
```

There are four major docstring formats:
- Google Style
- Numpydoc
- reStructured Text
- EpyText

This course focuses on Google Style and Numpydoc.

#### Google Style Docstrings (Demonstration)

Google style docstrings are used by this course because the format is more
compact.

```python
def function(arg_1, arg_2=42):
    """
    Imperative description of what the function does.
    
    Args:
        arg_1 (str): Description of arg_1 that can break into the next line
            if needed.
        arg_2 (int, optional): Write optional when an argument has a default
            value
    
    Returns:
        bool: Optional description of the return value
        Extra lines are not indented
    
    Raises:
        ValueError: Include any error types that the function intentionally
            raises
    
    Notes:
        See https://www.datacamp.com/tutorial/docstrings-python
        for more information.
    """
```

#### Numpydoc Docstrings (Demonstration)

Numpydoc docstrings are the most common in the scientific community.

```python
def function(arg_1, arg_2=42):
    """
    Imperative description of what the function does.
    
    Parameters
    ----------
    arg_1 : expected type of arg_1
        Description of arg_1
    arg_2 : int, optional
        Write optional when an argument has a default value.
        Default=42.
        
    Returns
    -------
    The type of the return value
        Can include a description of the return value.
        Replace "Returns" with "Yields" if this function is a generator.
    """
```

#### Retrieving Docstrings (Demonstration)

In [None]:
def the_answer():
    """
    Return the answer to life,
    the universe, and everything.

    Returns:
        int
    """
    return 42
print(the_answer.__doc__)
# Remove leading spaces.
print(inspect.getdoc(the_answer))

#### Crafting a Docstring (Exercise)

In [None]:
def count_letter(content, letter):
    """
    Count the number of times `letter` appears in `content`.

    Args:
        content (str): The string to search.
        letter (str): The letter to search for.
    
    Returns:
        int
    
    Raises:
        ValueError: If `letter` is not a one-character string.
    """
    if (not isinstance(letter, str)) or len(letter) != 1:
        raise ValueError('`letter` must be a single character string.')
    return len([char for char in content if char == letter])

#### Retrieving Docstrings (Exercise)

In [None]:
# Display the unprocessed docstring.
docstring = count_letter.__doc__
border = '#' * 28
print('{}\n{}\n{}'.format(border, docstring, border))

# Use inspect.getdoc to remove leading and trailing blank lines and leading
# white space from the docstring.
docstring = inspect.getdoc(count_letter)
border = '#' * 28
print('{}\n{}\n{}'.format(border, docstring, border))

def build_tooltip(function):
    """
    Create a tooltip for any function that shows the
    function's docstring.
    
    Args:
        function (callable): The function we want a tooltip for.
    
    Returns:
        str
    """
    # Get the docstring for the function argument by using inspect.
    docstring = inspect.getdoc(function)
    border = "#" * 28
    return "{}\n{}\n{}".format(border, docstring, border)

print(build_tooltip(count_letter))
print(build_tooltip(range))
print(build_tooltip(print))
print(build_tooltip(build_tooltip))
print(build_tooltip(inspect.getdoc))

#### Docstrings to the Rescue! (Exercise)

In [None]:
# This was an exercise in looking at docstrings.
print(np.histogram.__doc__)

### DRY and "Do One Thing"

#### Don't Repeat Yourself (Demonstration)

This code repeats itself, and it contains an error in the last code block (`### yikes! ###`). Code like this is difficult to maintain because any change in the algorithm must be made at three separate locations.

```python
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA

# Analyze the training data.
train = pd.read_csv("train.csv")
train_y = train["labels"].values
train_X = train[col for col in train.columns if col != "labels"].values
train_pca = PCA(n_components=2).fit_transform(train_X)
plt.scatter(train_pca[:, 0], train_pca[:, 1])

# Analyze the validation data.
val = pd.read_csv("validation.csv")
val_y = val["labels"].values
val_X = val[col for col in val.columns if col != "labels"].values
val_pca = PCA(n_components=2).fit_transform(val_X)
plt.scatter(val_pca[:, 0], val_pca[:, 1])

# Analyze the test data.
test = pd.read_csv("test.csv")
test_y = test["labels"].values
test_X = test[col for col in test.columns if col != "labels"].values
test_pca = PCA(n_components=2).fit_transform(train_X) ### yikes! ###
plt.scatter(test_pca[:, 0], test_pca[:, 1])
```

This is where a function is useful for eliminating repeated code. The function
does the desired work and returns the x and y values for each dataset for
further use. (Note that we provide a well-formatted docstring for the
function.)

```python
def load_and_plot(path):
    """
    Load a dataset and plot the first two principal components.
    
    Args:
        path (str): The location of the CSV file.
    
    Returns:
        tuple of ndarray: (features, labels)
    """
    data = pd.read_csv(path)
    data_y = data["labels"].values
    data_X = data[col for col in data.columns if col != "labels"].values
    data_pca = PCA(n_components=2).fit_transform(data_X)
    plt.scatter(data_pca[:, 0], data_pca[:, 1])
    return data_X, data_y

train_X, train_y = plot_pca("train.csv")
val_X, val_y = plot_pca("validation.csv")
test_X, test_y = plot_pca("test.csv")
```

At this point, this function violates another software engineering principle:
It does not do just one thing. The function does three things:

1) it loads data
2) it transforms data
3) it plots data

Here, the course creates two functions that decouple data loading from data transformation and plotting.

```python
def load_data(path):
    """
    Load a dataset and return the x and y values.
    
    Args:
        path (str): The location of the CSV file.
    
    Returns:
        tuple of ndarray: (features, labels)
    """
    data = pd.read_csv(path)
    data_y = data["labels"].values
    data_X = data[col for col in data.columns if col != "labels"].values
    return data_X, data_y

def plot_data(data_X):
    """
    Plot the first two principal components of a matrix.
    
    Args:
        data_X (numpy.ndarray): The data to plot.
    """
    data_pca = PCA(n_components=2).fit_transform(data_X)
    plt.scatter(data_pca[:, 0], data_pca[:, 1])
```

When writing functions that do one thing, the code becomes:
- more flexible
- more easily understood
- simpler to test
- simpler to debug
- easier to change

Shayne Miel recommends reading _Refactoring: Improving the Design of Existing Code (2nd Edition)_ by Martin Fowler.

By the way, reading this code inspired me to look into how to do principle component analysis using PCA. See [Principal Component Analysis of Breast Cancer Dataset](../Principal%20Component%20Analysis%20in%20Python/Principal%20Component%20Analysis%20of%20Breast%20Cancer%20Dataset.ipynb).

#### Extract a Function (Exercise)

Create a function that standardizes the values in a column, and use it on four columns of a DataFrame.

```python
def standardize(column):
    """
    Standardize the values in a column.

    Args:
        column (pandas Series): The data to standardize.

    Returns:
        pandas Series: the values as z-scores
    """
    # Finish the function so that it returns the z-scores
    z_score = (column - column.mean()) / column.std()
    return z_score

# Use the standardize() function to calculate the z-scores
df['y1_z'] = standardize(df.y1_gpa)
df['y2_z'] = standardize(df.y2_gpa)
df['y3_z'] = standardize(df.y3_gpa)
df['y4_z'] = standardize(df.y4_gpa)
```

#### Split Up a Function (Exercise)

Split up the original function, which calculates both mean and median and returns them, into two functions, each of which does one thing.

```python
def mean(values):
    """
    Return the mean of a sorted list of values.
    
    Args:
        values (iterable of float): A list of numbers
    
    Returns:
        float
    """
    mean = sum(values) / len(values)
    return mean

def median(values):
    """
    Return the median of a sorted list of values.
    
    Args:
        values (iterable of float): A list of numbers
    
    Returns:
        float
    """
    midpoint = int(len(values) / 2)
    if len(values) %2 == 0:
        median = (values[midpoint - 1] + values[midpoint]) / 2
    else:
        median = values[midpoint]
    
    return median
```

### Pass by Assignment

A list is mutable, but an integer is immutable.

See [Pass-by-value, reference, and assignment](https://mathspp.com/blog/pydonts/pass-by-value-reference-and-assignment).

In [None]:
# Pass by reference (using a pointer).
def foo(x):
    x[0] = 99
my_list = [1, 2, 3]
print(my_list)
foo(my_list)
print(my_list)
print()

# Pass by value?
def bar(x):
    x = x + 90
my_var = 3
print(my_var)
bar(my_var)
print(my_var)
print()

# a and b refer to the same list.
a = [1, 2, 3]
print(a)
b = a
a.append(4)
print(b)
b.append(5)
print(a)

Immutable data types:
- int
- float
- bool
- string
- bytes
- tuple
- frozenset
- None

Mutable data types:
- list
- dict
- set
- bytearray
- objects
- functions
- almost everything else!

#### Mutable Default Arguments Are Dangerous! (Demonstration)

See this example for why you shouldn't set a default to an empty list or another mutable object.

In [None]:
def foo(var=[]):
    var.append(1)
    return var
print(foo())
print(foo())
print()

# This is the correct way.
def foo2(var=None):
    if var is None:
        var = []
    var.append(1)
    return var
print(foo2())
print(foo2())

#### Mutable or Immutable? (Exercise)

The following function adds a mapping between a string and the lowercase version of that string to a dictionary. What do you expect the values of d and s to be after the function is called?

In [None]:
def store_lower(_dict, _string):
    """
    Add a mapping between `_string` and a lowercased version of `_string` to
    `_dict`

    Args:
        _dict (dict): The dictionary to update.
        _string (str): The string to add.
    """
    orig_string = _string
    _string = _string.lower()
    _dict[orig_string] = _string

# A dictionary is a mutable object, but a string is immutable.
d = {}
s = 'Hello'

store_lower(d, s)
print(d,)
print(s)

#### Best Practice for Default Arguments (Exercise)

Avoid using a mutable default argument.

In [None]:
def better_add_column(values, df=None):
    """
    Add a column of `values` to a DataFrame `df`.
    The column will be named "col_<n>", where "n" is
    the numerical index of the column.
    
    Args:
        values (iterable): The values of the new column
        df (DataFrame, optional): The DataFrame to update.
            If no DataFrame is passed, one is created by default.
    
    Returns:
        DataFrame
    """
    if df is None:
        df = pd.DataFrame()
    df["col_{}".format(len(df.columns))] = values
    return df

df = better_add_column([1, 2, 3], None)
df = better_add_column([4, 5, 6], df)
print(df.head())

## Context Managers

### Using Context managers

#### Examples (Demonstration)

A context manager sets up a contex, runs your code, and removes the context. Here, `open()` sets up a context by opening a file, lets you run any code you want on that file, and removes the context by closing the file.

```python
with open("my_file.txt") as my_file:
    text = my_file.read()
    length = len(text)
print("The file is {} characters long.".format(length))
```

Using `with` creates a compound statement, which is used as shown below:

```python
with <context-manager>(<args>):
    # Run your code here.
    # This code is running "inside the context"
# This code runs after the context is removed.
```

Some context managers return a value. Use `as` to capture that value. For example, `with open()` returns a file handle, which can be used within the context.

#### Reading a File (Exercise)

How many times does the word "cat" or "cats" appear in _Alice in Wonderland_?

In [None]:
# The context manager closes the file for you.
with open("alice.txt") as file:
    text = file.read()
n = 0
for word in text.split():
    if word.lower() in ["cat", "cats"]:
        n += 1
print('Lewis Carroll used the word "cat" {} times.'.format(n))


#### Using a Timer Context Manager (Exercise)

I used the IPython shell to obtain the code for the functions used to support this example. It was amusing and instructive to do this.

    In [6]: import inspect
    In [7]: print(inspect.getsource(get_image_from_instagram))
    def get_image_from_instagram():
      return np.random.rand(84, 84)
    
    In [8]: print(inspect.getsource(process_with_numpy))
    def process_with_numpy(p):
      _process_pic(0.1521)
    
    In [9]: print(inspect.getsource(process_with_pytorch))
    def process_with_pytorch(p):
      _process_pic(0.0328)
    
    In [10]: print(inspect.getsource(_process_pic))
    def _process_pic(n_sec):
      print('Processing', end='', flush=True)
      for i in range(10):
        print('.', end='' if i < 9 else 'done!\n', flush=True)
        time.sleep(n_sec)
    
    In [11]: print(inspect.getsource(timer))
    @contextlib.contextmanager
    def timer():
      """Time how long code in the context block takes to run."""
      t0 = time.time()
      try:
        yield
      except:
        raise
      finally:
        t1 = time.time()
      print('Elapsed: {:.2f} seconds'.format(t1 - t0))

In [None]:
# This code supports the simulation.
def get_image_from_instagram():
    return np.random.rand(84, 84)

def process_with_numpy(p):
    _process_pic(0.1521)

def process_with_pytorch(p):
    _process_pic(0.0328)

def _process_pic(n_sec):
    print('Processing', end='', flush=True)
    for i in range(10):
        print('.', end='' if i < 9 else 'done!\n', flush=True)
        time.sleep(n_sec)

@contextlib.contextmanager
def timer():
    """
    Time how long code in the context block takes to run.
    """
    t0 = time.time()
    try:
        yield
    except:
        raise
    finally:
        t1 = time.time()
    print('Elapsed: {:.2f} seconds'.format(t1 - t0))

# Exercise code.
image = get_image_from_instagram()
with timer():
    print('Numpy version')
    process_with_numpy(image)
print()
with timer():
    print('Pytorch version')
    process_with_pytorch(image)

You may have noticed there was no `as <variable name>` at the end of the with statement in `timer()` context manager. That is because `timer()` is a context manager that does not return a value, so the `as <variable name>` at the end of the `with` statement isn't necessary. In the next lesson, you'll learn how to write your own context managers like `timer()`.

### Writing Context Managers

## Decorators

## More on Decorators