# Data science in Python

- Course GitHub repo: https://github.com/pycam/python-data-science
- Python website: https://www.python.org/ 

## Session 1.3: Creating functions and modules to write reusable code

- [Building reusable and modular code with functions](#Building-reusable-and-modular-code-with-functions)
    - [Exercise 1.3.1](#Exercise-1.3.1)
    - [Exercise 1.3.2](#Exercise-1.3.2)
- [Create your own module](#Create-your-own-module)
    - [Exercise 1.3.3](#Exercise-1.3.3)

## Mind map

<img src="img/mind_maps/mind_maps.003.jpeg">

## Building reusable and modular code with functions

So far, we’ve used Python to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The beauty of using a programming language like Python, though, comes from the ability to automate data processing through the use of loops and functions.

Suppose now that we would like to calculate the average GDP per capita, its median and standard deviation for all continents over all the years. We could write specific conditions for each case, and write the same code over again for the different situation but that would be time consuming, error prone and hard to maintain. A more elegant solution would be to create a **reusable tool** that performs this task with minimum input from the user. To do this, we are going to turn the code we’ve already written into a **function**.

Functions are reusable, self-contained pieces of code that are called with a single command. They can be designed to accept arguments as input and return values, but they don’t need to do either. Variables declared inside functions only exist while the function is running and if a variable within the function (a local variable) has the same name as a variable somewhere else in the code, the local variable hides but doesn’t overwrite the other.

Every method used in Python (for example, **`print()`**) is a function, and the libraries we import (say, `csv` or `os`) are a collection of functions.

We will first use functions that are housed within the same code that uses them, and then create our own module to write functions that can be used by different programs.

### Function definition

Functions are declared following this general structure:

In [None]:
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2

The function declaration starts with the word **`def`**, followed by the function name and any arguments in parenthesis, and ends with a colon. The body of the function is indented just like loops are. If the function returns something when it is called, it includes a **`return`** statement at the end.

Once the `return` statement is reached the operation of the function ends, and anything on the return line is passed back as output.

### Function call

This is how we call the function:

In [None]:
product_of_inputs = this_is_the_function_name(2, 5)

In [None]:
print('Their product is:', product_of_inputs, '(this is done outside the function!)')

### Function arguments
If we change the values of the arguments when calling the function, then its output changes:

In [None]:
product_of_inputs = this_is_the_function_name(4, 7)
print('Their product is:', product_of_inputs, '(this is done outside the function!)')

If we call the function by giving it the wrong number of arguments (not 2), we get a `TypeError`:

In [None]:
product_of_inputs = this_is_the_function_name(4)

The arguments we have passed to the function so far have all been **mandatory**, if we do not supply them or if supply the wrong number of arguments Python will throw an error.

**Mandatory arguments are assumed to come in the same order as the arguments in the function definition**, but you can also opt to specify the arguments using the argument names as _keywords_, supplying the values corresponding to each keyword with a `=` sign.

In [None]:
product_of_inputs = this_is_the_function_name(input_argument1=3, input_argument2=2)

In [None]:
product_of_inputs = this_is_the_function_name(input_argument2=3, input_argument1=2)

**BEWARE!** Unnamed (positional) arguments must come before named (keyword) arguments, otherwise we will get a `SyntaxError`:

In [None]:
product_of_inputs = this_is_the_function_name(3, input_argument2=2)

In [None]:
product_of_inputs = this_is_the_function_name(input_argument2=2, 3)

### Function returned values

If we call the function by not assigning the function call to a variable (`product_of_inputs =`), we are unable to retrieve the output of the function passed back via the `return` statement, but the code within the function is still executed:

In [None]:
this_is_the_function_name(4, 7)

The function written so far has returned only a single value, however it is possible to pass back more than one value via the `return` statement. In the following example, we change the function that takes two arguments and passes back three values: the total, the difference and the product of these two arguments. The return values are really passed back inside a single tuple, which can be caught as a single collection of values. 

In [None]:
def this_is_the_function_name_returning_multiple_values(input_argument1, input_argument2):
    
    total = input_argument1 + input_argument2
    difference = input_argument1 - input_argument2
    product = input_argument1 * input_argument2
    
    return total, difference, product

In [None]:
returned_collection = this_is_the_function_name_returning_multiple_values(2, 4)

In [None]:
print(returned_collection)

In [None]:
total_of_inputs, difference_of_inputs, product_of_inputs = this_is_the_function_name_returning_multiple_values(2, 4)

In [None]:
print(total_of_inputs, difference_of_inputs, product_of_inputs)

There can be more than one `return` statement in a function, although typically there is only one, at the end of the function. The `return` keyword immediately exits the function, and no more of the code in that function will be run once the function has returned.

In [None]:
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    
    # This is a variable inside the function
    variable_inside_function = '(this is done inside the function!)'

    # And returns their product
    return input_argument1 * input_argument2

    # This function does not print the two arguments to screen (no code executed after return statement)
    print('The function arguments are:', input_argument1, input_argument2, variable_inside_function)

In [None]:
product_of_inputs = this_is_the_function_name(4, 7)
print('Their product is:', product_of_inputs, '(this is done outside the function!)')

### Function variable scope

If we declare a variable inside the function, it is a local variable only visible within the function, we are therefore unable to access it outside the function:

In [None]:
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    
    # This is a variable inside the function
    variable_inside_function = '(this is done inside the function!)'

    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, variable_inside_function)
    
    # And returns their product
    return input_argument1 * input_argument2

In [None]:
product_of_inputs = this_is_the_function_name(5, 2)

In [None]:
print(variable_inside_function)

In [None]:
print(product_of_inputs)

When a variable is declared both inside and outside the function using the same name, only the value of the outside variable (the global one) is visible and accessible, changing it within the function does not change it outside:

In [None]:
variable_inside_and_outside_function = 'this is a variable created outside the function'

def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    
    # This is a variable inside the function
    variable_inside_function = '(this is done inside the function!)'
    
    # This is a variable created outside and modified inside the function
    variable_inside_and_outside_function = 'this is a variable changed inside the function'
    print(variable_inside_and_outside_function)

    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, variable_inside_function)
    
    # And returns their product
    return input_argument1 * input_argument2

**BEWARE!** When using Jupyter Notebooks and modifying a function, you MUST re-run that cell in order for the changed function to be available to the rest of the code. Nothing will visibly happen when you do this, though, because simply defining a function without calling it doesn’t produce an output. Any cells that use the now-changed functions will also have to be re-run for their output to change.

In [None]:
product_of_inputs = this_is_the_function_name(10, 3)

In [None]:
print(variable_inside_and_outside_function)

### Function documentation

The text between the two sets of triple double quotes is called a **docstring** and contains the documentation for the function. It does nothing when the function is running and is therefore not necessary, but it is good practice to include docstrings as a reminder of what the code does. Docstrings in functions also become part of their ‘official’ documentation:

In [None]:
def this_is_the_function_name(input_argument1, input_argument2):
    """
    This is the documentation of the function.
    Returns the product of the two arguments.
    
    input_argument1 --- first input argument 
    input_argument1 --- second input argument
    """

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2

In [None]:
help(this_is_the_function_name)

## Exercise 1.3.1

- Write a function that takes two arguments and returns their mean. 
    - Give your function a meaningful name, and a good documentation. 
    - Call your function multiple times with different values, and once using the keyword arguments with their associated values.
    - Print the result of these different function calls.
- Write another function that takes a list as argument and returns the mean and the median of all the numbers in the list.

### Writing our own function

We can now turn our code for calculating the average GDP per capita, its median and standard deviation for all continents over all the years into a function. 

Here is the original code we wrote:

In [None]:
import os
import statistics as stats
import csv
eu_gdppercap_1962 = []
americas_gdppercap_1962 = []
with open(os.path.join('data', 'gapminder.csv')) as f:
    reader = csv.DictReader(f, delimiter = ",")
    for data in reader:        
        if data['year'] == "1962":
            if data['continent'] == "Europe":
                eu_gdppercap_1962.append(float(data['gdpPercap']))
            if data['continent'] == 'Americas':
                americas_gdppercap_1962.append(float(data['gdpPercap']))
                

print('European GDP per Capita in 1962')
print(eu_gdppercap_1962)
print('average:', stats.mean(eu_gdppercap_1962))
print('median:', stats.median(eu_gdppercap_1962))
print('standard deviation:', stats.stdev(eu_gdppercap_1962))

print('American GDP per Capita in 1962')
print(americas_gdppercap_1962)
print('average:', stats.mean(americas_gdppercap_1962))
print('median:', stats.median(americas_gdppercap_1962))
print('standard deviation:', stats.stdev(americas_gdppercap_1962))

Let’s first write a function that filters data for a continent and a specific year, and calculates the average, median and standard deviation of the GDP of the countries of this continent:

In [None]:
import statistics as stats
import csv

def gdp_stats_by_continent_and_year(gapminder_filepath, continent, year):
    """
    Returns a dictionary of the average, median and standard deviation of GDP per capita 
    for all countries of the selected continent for a given year.

    gapminder_filepath --- gapminder file path with multi-continent and multi-year data
    continent --- continent for which data is extracted
    year --- year for which data is extracted
    """
    gdppercap = []
    with open(gapminder_filepath) as f:
        reader = csv.DictReader(f, delimiter = ",")
        for data in reader: 
            if data['continent'] == continent and data['year'] == year:
                gdppercap.append(float(data['gdpPercap']))
    print(continent, 'GDP per Capita in', year)
    return {'mean': stats.mean(gdppercap), 'median': stats.median(gdppercap), 'stdev': stats.stdev(gdppercap)}

In [None]:
help(gdp_stats_by_continent_and_year)

In [None]:
import os       
gdp_stats = gdp_stats_by_continent_and_year(os.path.join('data', 'gapminder.csv'), 'Europe', '1962')
print(gdp_stats)

In [None]:
import os       
gdp_stats = gdp_stats_by_continent_and_year(os.path.join('data', 'gapminder.csv'), 'Europe', '2007')
print(gdp_stats['mean'])

In [None]:
import os       
gdp_stats = gdp_stats_by_continent_and_year(os.path.join('data', 'gapminder.csv'), 'Americas', '1962')
print(gdp_stats)

In [None]:
import os       
gdp_stats = gdp_stats_by_continent_and_year(os.path.join('data', 'gapminder.csv'), 'Africa', '1962')
print(gdp_stats)

### Function arguments with default values

The functions we wrote demand that we give them a value for every argument. Ideally, we would like these functions to be as flexible and independent as possible. 

Let’s modify the function `gdp_stats_by_continent_and_year` so that the `continent` and `year` default to `Europe` and `1952` if they are not supplied by the user. We can do this by assigning some value to the named argument with the `=` operator in the function definition.

Any arguments in the function without default values (here, `gapminder_filepath`) is a required argument and MUST come before the argument with default values (which are optional in the function call).

In [None]:
import statistics as stats
import csv

def gdp_stats_by_continent_and_year(gapminder_filepath, continent='Europe', year='1952'):
    """
    Returns a dictionary of the average, median and standard deviation of GDP per capita 
    for all countries of the selected continent for a given year.

    gapminder_filepath --- gapminder file path with multi-continent and multi-year data
    continent --- continent for which data is extracted
    year --- year for which data is extracted
    """
    gdppercap = []
    with open(gapminder_filepath) as f:
        reader = csv.DictReader(f, delimiter = ",")
        for data in reader: 
            if data['continent'] == continent and data['year'] == year:
                gdppercap.append(float(data['gdpPercap']))
    print(continent, 'GDP per Capita in', year)
    return {'mean': stats.mean(gdppercap), 'median': stats.median(gdppercap), 'stdev': stats.stdev(gdppercap)}

In [None]:
import os       
gdp_stats = gdp_stats_by_continent_and_year(os.path.join('data', 'gapminder.csv'))
print(gdp_stats)

## Exercise 1.3.2

- Generalise the code written for exercise 1.1.3 for finding which European countries have the largest population in 1952 and 2007 by creating a function that finds which country on a defined continent has the largest population for a given year. Provide default values for certain arguments.

## Create your own module

So far we have been writing Python code in files as executable scripts without knowing that they are also modules from which we are able to call the different functions defined in them.

A module is a file containing Python definitions and statements. The file name is the module name with the suffix `.py` appended. Create a file called `this_is_the_module_name.py` in the current directory with the function `this_is_the_function_name()` written earlier as its contents:

In [None]:
def this_is_the_function_name(input_argument1, input_argument2):
    """
    This is the documentation of the function.
    Returns the product of the two arguments.
    
    input_argument1 --- first input argument 
    input_argument1 --- second input argument
    """

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2

Now open a terminal windown, enter into the Python interpreter from the directory you've created `this_is_the_module_name.py` file and import it:

```
python3
Python 3.6.4 (default, Jan 21 2018, 20:11:12) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import this_is_the_module_name
>>> product_of_inputs = this_is_the_module_name.this_is_the_function_name(10, 3)
The function arguments are: 10 3 (this is done inside the function!)
>>> print(product_of_inputs)
30
>>>
```

If you wish to import it into this notebook, below is what you need to do. If you wish to edit the module file and change the code or add another function, you will have to restart the notebook to have these changes taken into account using the restart the kernel button in the menu bar.

In [None]:
import this_is_the_module_name
product_of_inputs = this_is_the_module_name.this_is_the_function_name(10, 3)

A module can contain executable statements as well as function definitions. These statements are intended to initialize the module. They are executed only the first time the module name is encountered in an import statement. 
They are also run if the file is executed as a script.

Do comment out these executable statements if you do not wish to have them executed when importing your module.

For more information about modules, https://docs.python.org/3/tutorial/modules.html.

## Exercise 1.3.3

- Create a module with the two functions written so far to analyse the Gapminder dataset. Import the module, and call these functions multiple times with different arguments.
- Create a new function in this module that returns the average life expectancy on a given continent for a given year. Call this function with different arguments and compare the results.

## Next session: see you tomorrow!

Go to our next notebook: [Introduction to Day 2](20_python_data_intro.ipynb).