# Writing Functions in Python

## Introduction

This course is presented by Shayne Miel, Director of Software Engineering at American Efficient. Collaborators are Hillary Green-Lerman and Becca Robbins.

Prerequisite:
- Python Data Science Toolbox (Part 2)

This course is part of these tracks:
- Data Engineer with Python
- Data Scientist with Python
- Python Programmer
- Python Programming

There are no datasets for this course.

## Versions

The course's IDE uses Python 3.9.7 (default, Sep 10 2021, 00:03:59) \[GCC 7.5.0].

This notebook is being written using Python 3.11.1.

## Resources

### Docstrings
- [Python PEP 257 - Docstring Conventions](https://peps.python.org/pep-0257/)
- [reStructuredText Markup](https://devguide.python.org/documentation/markup/)
- [DataCamp Docstrings Tutorial](https://www.datacamp.com/tutorial/docstrings-python)
- [Numpy Style Guide](https://numpydoc.readthedocs.io/en/latest/format.html)
- [Google Style Guide for Comments and Docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)

### inspect Module
- [inspect - Inspect live objects](https://docs.python.org/3/library/inspect.html)

## Imports

Imports are gathered here for clarity and convenience.

In [None]:
import inspect

import numpy as np

## Best Practices

### Docstrings

#### Example Docstring (Demonstration)

```python
def split_and_stack(df, new_names):
    """
    Split a DataFrame's columns into two halves and then stack
    them vertically, returning a new DataFrame with 'new_names' as the
    column names.
    
    Args:
        df (DataFrame): The DataFrame to split.
        new_names (iterable of str): The column names for the new DataFrame
    
    Returns:
        DataFrame
    """
    half = int(len(df.columns) / 2)
    left = df.iloc[:, :half]
    right = df.iloc[:, half:]
    return pd.DataFrame(
        data=np.vstack([left.values, right.values]),
        columns=new_names
    )
```

#### Anatomy of a Docstring (Demonstration)

```python
def function_name(arguments):
    """
    Description of what the function does.
    
    Description of the arguments, if any.
    
    Description of the return values, if any.
    
    Description of errors raised, if any.
    
    Optional extra notes or examples of usage.
    """
```

There are four major docstring formats:
- Google Style
- Numpydoc
- reStructured Text
- EpyText

This course focuses on Google Style and Numpydoc.

#### Google Style Docstrings (Demonstration)

Google style docstrings are used by this course because the format is more
compact.

```python
def function(arg_1, arg_2=42):
    """
    Imperative description of what the function does.
    
    Args:
        arg_1 (str): Description of arg_1 that can break into the next line
            if needed.
        arg_2 (int, optional): Write optional when an argument has a default
            value
    
    Returns:
        bool: Optional description of the return value
        Extra lines are not indented
    
    Raises:
        ValueError: Include any error types that the function intentionally
            raises
    
    Notes:
        See https://www.datacamp.com/tutorial/docstrings-python
        for more information.
    """
```

#### Numpydoc Docstrings (Demonstration)

Numpydoc docstrings are the most common in the scientific community.

```python
def function(arg_1, arg_2=42):
    """
    Imperative description of what the function does.
    
    Parameters
    ----------
    arg_1 : expected type of arg_1
        Description of arg_1
    arg_2 : int, optional
        Write optional when an argument has a default value.
        Default=42.
        
    Returns
    -------
    The type of the return value
        Can include a description of the return value.
        Replace "Returns" with "Yields" if this function is a generator.
    """
```

#### Retrieving Docstrings (Demonstration)

In [None]:
def the_answer():
    """
    Return the answer to life,
    the universe, and everything.

    Returns:
        int
    """
    return 42
print(the_answer.__doc__)
# Remove leading spaces.
print(inspect.getdoc(the_answer))

#### Crafting a Docstring (Exercise)

In [None]:
def count_letter(content, letter):
    """
    Count the number of times `letter` appears in `content`.

    Args:
        content (str): The string to search.
        letter (str): The letter to search for.
    
    Returns:
        int
    
    Raises:
        ValueError: If `letter` is not a one-character string.
    """
    if (not isinstance(letter, str)) or len(letter) != 1:
        raise ValueError('`letter` must be a single character string.')
    return len([char for char in content if char == letter])

#### Retrieving Docstrings (Exercise)

In [None]:
# Display the unprocessed docstring.
docstring = count_letter.__doc__
border = '#' * 28
print('{}\n{}\n{}'.format(border, docstring, border))

# Use inspect.getdoc to remove leading and trailing blank lines and leading
# white space from the docstring.
docstring = inspect.getdoc(count_letter)
border = '#' * 28
print('{}\n{}\n{}'.format(border, docstring, border))

def build_tooltip(function):
    """
    Create a tooltip for any function that shows the
    function's docstring.
    
    Args:
        function (callable): The function we want a tooltip for.
    
    Returns:
        str
    """
    # Get the docstring for the function argument by using inspect.
    docstring = inspect.getdoc(function)
    border = "#" * 28
    return "{}\n{}\n{}".format(border, docstring, border)

print(build_tooltip(count_letter))
print(build_tooltip(range))
print(build_tooltip(print))
print(build_tooltip(build_tooltip))
print(build_tooltip(inspect.getdoc))

#### Docstrings to the Rescue! (Exercise)

In [None]:
# This was an exercise in looking at docstrings.
print(np.histogram.__doc__)

### DRY and "Do One Thing"

#### Don't Repeat Yourself (Demonstration)

This code repeats itself, and it contains an error in the third block (`### yikes! ###`). This code is difficult to maintain because any change in the algorithm must be made at three separate locations.

```python
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA

# Analyze the training data.
train = pd.read_csv("train.csv")
train_y = train["labels"].values
train_X = train[col for col in train.columns if col != "labels"].values
train_pca = PCA(n_components=2).fit_transform(train_X)
plt.scatter(train_pca[:, 0], train_pca[:, 1])

# Analyze the validation data.
val = pd.read_csv("validation.csv")
val_y = val["labels"].values
val_X = val[col for col in val.columns if col != "labels"].values
val_pca = PCA(n_components=2).fit_transform(val_X)
plt.scatter(val_pca[:, 0], val_pca[:, 1])

# Analyze the test data.
test = pd.read_csv("test.csv")
test_y = test["labels"].values
test_X = test[col for col in test.columns if col != "labels"].values
test_pca = PCA(n_components=2).fit_transform(train_X) ### yikes! ###
plt.scatter(test_pca[:, 0], test_pca[:, 1])
```

This is where a function is useful for eliminating repeated code. The function
does the desired work and returns the x and y values for each dataset for
further use. (Note that we provide a well-formatted docstring for the
function.)

```python
def load_and_plot(path):
    """
    Load a dataset and plot the first two principal components.
    
    Args:
        path (str): The location of the CSV file.
    
    Returns:
        tuple of ndarray: (features, labels)
    """
    data = pd.read_csv(path)
    data_y = data["labels"].values
    data_X = data[col for col in data.columns if col != "labels"].values
    data_pca = PCA(n_components=2).fit_transform(data_X)
    plt.scatter(data_pca[:, 0], data_pca[:, 1])
    return data_X, data_y

train_X, train_y = plot_pca("train.csv")
val_X, val_y = plot_pca("validation.csv")
test_X, test_y = plot_pca("test.csv")
```

At this point, this function violates another software engineering principle:
It does not do one thing (it does two or three things, depending on how you
think about it).

Create two functions that decouple data loading from data transformation and plotting.

```python
def load_data(path):
    """
    Load a dataset and return the x and y values.
    
    Args:
        path (str): The location of the CSV file.
    
    Returns:
        tuple of ndarray: (features, labels)
    """
    data = pd.read_csv(path)
    data_y = data["labels"].values
    data_X = data[col for col in data.columns if col != "labels"].values
    return data_X, data_y

def plot_data(data_X):
    """
    Plot the first two principal components of a matrix.
    
    Args:
        data_X (numpy.ndarray): The data to plot.
    """
    data_pca = PCA(n_components=2).fit_transform(data_X)
    plt.scatter(data_pca[:, 0], data_pca[:, 1])
```

When writing functions that do one thing, the code becomes:
- more flexible
- more easily understood
- simpler to test
- simpler to debug
- easier to change

Shayne Miel recommends reading _Refactoring: Improving the Design of Existing Code (2nd Edition)_ by Martin Fowler.

By the way, reading this code inspired me to look into how to do principle component analysis using PCA. See [Principal Component Analysis of Breast Cancer Dataset](../Principal%20Component%20Analysis%20in%20Python/Principal%20Component%20Analysis%20of%20Breast%20Cancer%20Dataset.ipynb).

## Context Managers

## Decorators

## More on Decorators