# Section C - Functions and Module Imports

Feedback: https://forms.gle/Le3RAsMEcYqEyswEA

**Topics**: Introducing functions and modules in Python. Basic introduction to pandas for data analysis, focusing on importing data and initial data exploration.

## Functions
A function is a grouping of code that we assign a name and can pass specific data to (arguments) and return data from (return value)

We use functions for a few things:
* Reduce dupliation in code - use the same function in multiple places in your code.
* Simplify code - breaking down complex code into smaller, separate, problems make the entire code more managable and maintainable. 
* Readability - named functions say specifically what they're going to do, so our program is less cluttered and easier to follow. 

A note about programming in notebooks like this... breaking code up into cells helps to organize it like a function might in a script.  And the most improvemint I've seen in notbooks on duplicated code is by organizing data into dictionaries and using loops to work on each group of data one at a time.  Functions are very important when writing scripts and larger programs, but a little less so in notebooks, except that when we import libraries, we call functions in the librar

### General Format of a Function
Here's how we define a function:

    def function_name(arg1, arg2, ...):
        '''function description in tripple quoted block of text.
        This is not mandatory, but is good practice.'''
        function
        code
        here
        some_value = foo
        return some_value

We can only return one object, but because that object can be a collection like a list or dictionary, we can bundle things to pass them all out.  Examples:

    return {'a': 'dictionary', 'is': 'okay}
    return 'this', 'will', 'return', 'a', 'tuple'
    x = ['a', 'list', 'works', 'too']
    return x

Here's an example returning a tuple:

In [None]:
def compute_stats_on_numbers(list_of_numbers):
    sum_of_numbers = sum(list_of_numbers)
    count_of_numbers = len(list_of_numbers)
    average_of_numbers = sum_of_numbers / count_of_numbers
    return sum_of_numbers, count_of_numbers, average_of_numbers  # This is a tuple.  The () around it are implied

numbers = [1, 2, 3, 4, 5]
num_sum, num_count, num_avg = compute_stats_on_numbers(numbers)

print(f'The function says - Sum: {num_sum}, Count: {num_count}, Average: {num_avg}')

### Scope
This is a new concept for us - there are certain places where variables can be defined that they will be unaccessible externally.  The variables have a specific scope in which they can be used. 
* **Global** - variables defined outside of functinos, classes, etc, in your program are accessible from everywhere. However, it's bad practice to use global variables from inside of functions as it makes it hard to follow what data is used by the function.  Side effects can be introduced.
* **Functions** - variables defined inside of functions are not visible outside of the function.  This means we don't neeed to worry about accidentally using a variable from a(nother) function when we don't mean to. 
* **Classes/Objects** - objects (instances of a class) have thier own variables/properties and functions that aren't accessible externally.
* **Modules** - modules imported like, "import pandas", have their own scope inside of "pandas" that we access via the module name, like "pandas.DataFrame".  If we were to do "from pandas import *", then all things in the pandas namespace would be populated into our global namespace and we could directly access DataFrame.  This can introduce problems, e.g., if multiple modules have things with the same name inside of them. It's better to import specific things to our global namespace if wanted... "from pandas import DataFrame" will only add the DataFrame class to our global namespace.
* And a few other places.  Try except blocks, inside of list comprehensions, etc.

What this means to us with regard to functions is that we should pass data the function needs in as arguments, create any variables in the function that we need without worrying about them polluting the namespace of our greater program, and then return the important data from the function with a return call.

#### *Exercise*
Let's investigate the nuances of global and local variables in a function.  Do this:

* Run the cell below and not the values of x inside and outside of the function.
* Uncomment the x=3 line and see what changes
* Uncomment the global declaration in the function ans run it again to see what changes.

At first, x only exists in the global namespace, so when we call print, python finds it there. 

When we uncomment the x=3, we define x in the function's local namespace, so that is what gets printed.  The function's namespace will always be used before the global namespace. Note that we don't overwrite the global namespace x value when we set x in the function. 

When we uncomment the global line, we are declaring that the x in the function is in the global namespace, so when we set x=3, we are able to change the global x.  There are times when this is useful, but in general we should try not to do this because it makes it harder to debug code and hides interaction between stuff.  We should pass data the function needs in as arguments. 

In [5]:
x = 2
def print_a_value(foo):
    # global x
    # x = 3
    print('x in func is:', x)
    print('foo in func is:', foo)
print('global x is', x)
print_a_value(x)
print('now global x is', x)

global x is 2
x in func is: 2
foo in func is: 2
now global x is 2


### Positional Arguments
When we define a function with multiple arguments like this:

    def do_the_thing(pos1, pos2, pos3, ..., posN):

We must pass the function N arguments with positions corresponding with the function definition.

    return_value = do_the_thing('stuff1', 'stuff2', 'stuff3', ..., 'stuffN')

### Optional Arguments
We can also set default values for arguments, startning with argument N and working backward.  We cannot set a default values for pos1 but not for pos2.

    def do_the_thing(arg1, arg2, arg3=False, arg4=True)

In this case, we must pass arg1 and arg2, but we can omit arg3 and arg4 if we don't need them. 

Consider this example:

In [None]:
import random
import pandas as pd

def generate_random_data(num_rows, num_cols, to_dataframe=False):
    '''Thes function accepts a number of rows and number of columns and
    generates a table of random data.  If to_dataframe is True, it will
    return a pandas DataFrame.  Otherwise, it will return a list of dictionaries.'''
    data = [
        {f'col_{j+1}': random.random() for j in range(num_cols)}
        for i in range(num_rows)
    ]
    
    if to_dataframe:
        data = pd.DataFrame(data)
    
    return data

# Example usage
random_data_list = generate_random_data(5, 3)  # not necessary to specify to_dataframe=False
random_data_df = generate_random_data(5, 3, to_dataframe=True)

print(random_data_list)
print(random_data_df)

In the two lines from Example usage, the fist line skipps passing to_dataframe because the default value is acceptable.

#### *Exercise*
Write a functon called **prompt_user** that accepts two arguments, **choices**, and **num_tries**.  

It should ask the user to chose one of the choices, and then try num_tries times to let them type in a choice. If what they type in doesn't match any choices, then have them try again.  If they don't do it successfully in num_tries, then return False. If they do chose one, then return that choice. 

In [None]:
def prompt_user(...):
    ...
    return user_input

In [None]:
# Let's test it!
choices = ['red', 'green', 'blue']
choice = prompt_user(choices, 2)
if choice:
    print(f'You chose {choice}')
else:
    print('You did not choose a valid option')

### Keyword Arguments
Finally, some functions have lots of optional arguments.  Often with default values of False for skipping some functionality in the function, or they could have sane defaults like a function to read_a_csv_file might have a default header_row=0 to use the first row of the file as the column headers.  You'd only change it when you call the function if you have padding rows at the top of your file. 

Let's look at some examples of the pandas read_excel function with different combinations of arguments given.  Compare to the function documentation here: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

If we have an excel file with multiple sheets, but we want specifically to load the data from sheet2, we can do:

    df = pd.read_excel('data.xlsx', sheet_name='Sheet2')

Or if it is only one sheet, but we want to load specific columns and skip the top two rows in the file:

    df = pd.read_excel('data.xlsx', usecols=['A', 'C', 'E'], skip_rows=2)


### Arbitrary Arguments
We won't get into this, but look into *args and **kwargs.  You can make a funcation accept any arguments.  An example use for this is cerating your own version of the print function:

    DEBUG = True

    def debug_print(*args, **kwargs):
        if DEBUG:
            print(*args, **kwargs)

#### *Exercise*
Let's make a "greeting_generator" function that accepts a few arguments and returns a string with the generated greeting message. Arguments:
* name - required argument, so it should not have a default value. 
* greeting - optional argument with a default value of "Hello". 
* punctuation - optional argument with default value of "!".
* height_in_feet - optional argument with default value of False. If given, we append the string with something witty about the user's height.

In [None]:
def greeting_generator(...):
    ...
    return greeting

# Let's test it!
print(greeting_generator('Bob'))
print(greeting_generator('Alice', 'Good morning'))
print(greeting_generator('Charlie', height=6))
print(greeting_generator('Diane', punctuation='!', height=4))
print(greeting_generator('Eve', 'Good night', '!', 5))

## Modules
We've used a few modules so far.  Here's a summary of some common modules:

* **Data analysis and math**
  * pandas - Manipulate structured data in DataFrames.  Built on numpy.  Sort of like excel but less tedious. 
  * matplotlib - Data visualization tool.  We use it to generate axis and subplots for more interesting plots. 
  * seaborn - Advanced data visualization and analysis tools.  
  * numpy - Work with arrays of data.  Vectorize data operations for performance. 
  * math - Trig functions, sqrt, etc. Used for individual values.  x = math.tan(y)
  * datetime - Convert string data to datetime objects and vice-versa.  Perform time operations, like adding hours, days, etc. 
* **OS and file handing**
  * sys - Access environment variables, "exit", get system information.
  * os - List files (os.listdir), modify permissions, filesystem stats, user account stuff. 
  * shutil - Helper funcions for moving and copying files, few other things.
  * tar, zip - Open or create zip and tar archive files with these. tar is more common on linux systems. 
  * subprocess - Execute programs or commands outside of python.
* **Data encapsulation and databases...**
  * json - Structured text format of the web and many things. Use format="pretty"
  * yaml - Like json, but more friedly for humans to edit the files.  More flexible allowing in line comments in the file.
  * pickle - pickle and unpickle nearly any python object to save in a file. 
  * sqlite3 - file based database
  * mysql - open source mysql database connections...
  * pyodbc - odbc based database connections
* **Network stuff**
  * requests - talk to web servers

### Conventional short names
Some modules have accepeted conventions for short names to reduce typing and whatnot. Here are a few common ones:
* import pandas as pd
* import numpy as np
* import subprosess as sp

### Importing modules or parts of modules

We can import modules and access their tools my module name like:

    import math
    x = math.sqrt(50)

Or we can import specific components of a module:

    from math import sqrt,cos,sin
    x = cos(30)

You can import all things from a module into your global namespace, but it's discouraged.  What if you import two modules that have components with the same names in them?

    from math import *

When we do "import math", all of the variables and functions in that module are protected in a private namespace that we access via math.something().  

#### *Exercise*
Let's start getting more familiar with pandas.
* Read through this: https://pandas.pydata.org/docs/user_guide/10min.html
* Import the pandas module and use dir(pd) to see what functionality is built into it.  Or if you have python running in a terminal, type 'pd.' and hit the tab key to show a list of functions built into it. 
* Create an empty pandas dataframe and do the same as above to see what functionality is built into it. 

In [None]:
import pandas as pd

df = pd.DataFrame()  # An empty dataframe

# Turtle Challenge with Functions
Note - you can find example code for running "turtle" in the A-Getting_Started notebook. 

This week, we can use functions to isolate complex operations into little chunks that are used by other code to perform complex behavior with simple, readable, code.
  
#### *Exercise*:
Streamline your turtle code from the Dictionaries and Loops notebook by moving the functionality to draw arbitrary shapes into a function.  The function should take arguments for numbers of sides and size and will be called from the ret of your code from last time.