# Jupyter notebooks and Python Guide: Best Practices
## How to write python code for projects

Here are some basic guidelines to writing python code in jupyter notebooks. 

- Use Python 3, there is really no reason to use 2 in this day and age.
- Try to keep variable names short, for example: this_is_a_bad_variable_name, good_vname. 
- Try to use underscores in writing variable names. Its more readable. 
- Add comments to your code generously, the more you comment, the more helpful it is for the reader. That reader includes you as well, suppose you come back after 6 months, you will most likely not remember what the code does, in such cases having the comments is very useful. Typically you should be writing more comments than code. 
- When using a jupyter notebook do not write 50 lines of code in each cell. Breakdown the code into related chunks and add them to a separate cell. For example: 

    ```python 
    unclean_df = "an unclean data frame"
    clean_df = "an intermediate cleaning step"
    clean_df2 = "applied cleaning operation on clean_df"
    plt.figure()
    plt.plot(clean_df2['column1'], clean_df2['column2']) 
    plt.show()
    ```

    you can perform the cleaning operation in the first cell and the plotting operation in the second cell. You dont need to perform everything in the first cell. 
- The only exception for this rule is if you reuse a varible name it can cause errors if you do so in a seperate cell. For example: 

    ```python 
    # CELL 1 
    unclean_df = pd.read_csv("data source")
    unclean_df = "an unclean data frame"
    clean_df = "applied cleaning operation on unclean_df"
    
    # CELL 2 
    clean_df  = clean_df["subset"] # Suppose you take a subset of the data
    plt.figure()
    plt.plot(clean_df['column1'], clean_df['column2']) 
    plt.show()
    ```
    In the above example, we will run into a problem since ```python clean_df``` has been modified in the second cell and overwritten. The value of the dataframe would depend on which cell you are runnning. In order to avoid such things, its a good idea to either use unique variable names each time, or if you have to reuse variable names then do so in the same cell as where the variable name was first declared/initiated. 
    
- Its good practice to package of code into functions, this way you can ensure that you are isolating steps. Its much harder to troubleshoot a large chuck of code rather than a set of functions.
- When you write functions, make sure you include a docstring and please use the Numpy/Scipy docstring format. This is one of the most readable docstring formats out there. 



## Docstring guide

We will follow the numpy/Scipy Docstring guide that can be found at https://numpydoc.readthedocs.io/en/latest/format.html#method-docstrings

Typically the docstring looks likes this: 

In [5]:
def function(var1, var2, var3): 
    """
    A Summary of what the function does. 
    
    Parameters
    ----------
    
    var1 : type
           Description of variable
    
    var2 : type
           Description of varaiable
    
    var3 : type
           Description of variable
           
    Returns
    -------
    
    r1 : type
         Description of return variable
         
    r2 : type 
         description of variable
    
    
    Notes
    -------
    
    Add these if you think its needed. Add extra information
    """
    
    "Code here "
    
    
    return r1, r2 
    

You can copy the above docstring and use it as a template in the functions you write. 

Below is an example of how the function ```np.mean``` is documented in the numpy documentation. You can see how each component is documented accordingly. 



In [4]:
import numpy as np
def mean(a, axis=None, dtype=None, out=None, keepdims=np._NoValue, *,
         where=np._NoValue):
    """
    Compute the arithmetic mean along the specified axis.
    Returns the average of the array elements.  The average is taken over
    the flattened array by default, otherwise over the specified axis.
    `float64` intermediate and return values are used for integer inputs.
    Parameters
    ----------
    a : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    axis : None or int or tuple of ints, optional
        Axis or axes along which the means are computed. The default is to
        compute the mean of the flattened array.
        .. versionadded:: 1.7.0
        If this is a tuple of ints, a mean is performed over multiple axes,
        instead of a single axis or all the axes as before.
    dtype : data-type, optional
        Type to use in computing the mean.  For integer inputs, the default
        is `float64`; for floating point inputs, it is the same as the
        input dtype.
    out : ndarray, optional
        Alternate output array in which to place the result.  The default
        is ``None``; if provided, it must have the same shape as the
        expected output, but the type will be cast if necessary.
        See :ref:`ufuncs-output-type` for more details.
    keepdims : bool, optional
        If this is set to True, the axes which are reduced are left
        in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the input array.
        If the default value is passed, then `keepdims` will not be
        passed through to the `mean` method of sub-classes of
        `ndarray`, however any non-default value will be.  If the
        sub-class' method does not implement `keepdims` any
        exceptions will be raised.
    where : array_like of bool, optional
        Elements to include in the mean. See `~numpy.ufunc.reduce` for details.
        .. versionadded:: 1.20.0
    Returns
    -------
    m : ndarray, see dtype parameter above
        If `out=None`, returns a new array containing the mean values,
        otherwise a reference to the output array is returned.
    See Also
    --------
    average : Weighted average
    std, var, nanmean, nanstd, nanvar
    Notes
    -----
    The arithmetic mean is the sum of the elements along the axis divided
    by the number of elements.
    Note that for floating-point input, the mean is computed using the
    same precision the input has.  Depending on the input data, this can
    cause the results to be inaccurate, especially for `float32` (see
    example below).  Specifying a higher-precision accumulator using the
    `dtype` keyword can alleviate this issue.
    By default, `float16` results are computed using `float32` intermediates
    for extra precision.
    Examples
    --------
    >>> a = np.array([[1, 2], [3, 4]])
    >>> np.mean(a)
    2.5
    >>> np.mean(a, axis=0)
    array([2., 3.])
    >>> np.mean(a, axis=1)
    array([1.5, 3.5])
    In single precision, `mean` can be inaccurate:
    >>> a = np.zeros((2, 512*512), dtype=np.float32)
    >>> a[0, :] = 1.0
    >>> a[1, :] = 0.1
    >>> np.mean(a)
    0.54999924
    Computing the mean in float64 is more accurate:
    >>> np.mean(a, dtype=np.float64)
    0.55000000074505806 # may vary
    Specifying a where argument:
    >>> a = np.array([[5, 9, 13], [14, 10, 12], [11, 15, 19]])
    >>> np.mean(a)
    12.0
    >>> np.mean(a, where=[[True], [False], [False]])
    9.0
    """
    kwargs = {}
    if keepdims is not np._NoValue:
        kwargs['keepdims'] = keepdims
    if where is not np._NoValue:
        kwargs['where'] = where
    if type(a) is not mu.ndarray:
        try:
            mean = a.mean
        except AttributeError:
            pass
        else:
            return mean(axis=axis, dtype=dtype, out=out, **kwargs)

    return _methods._mean(a, axis=axis, dtype=dtype,
                          out=out, **kwargs)