# Python Data Structures with Pandas

* Mark Andersen, June 5 2019

This tutorial covers basic programming constructs of python.

* From other tutorials we learned that colleagues with less formal software training need some gaps filled in about functions, arguments, exceptions.
* This tutorial covers basic data structures which are used across python modules
* Finally, the use of structures like lists and dictionaries in pandas sample code is illustrated.

Prerequisite: introductory tutorial to get environment setup

# Indentation and Whitespace

## Python is sensitive to whitespace, which is whether your code is indented

* Control flow (if, for) and functions, will have indentation -- introduced below.  You use the tab key to indent.  Typical editor will replace the tab with four spaces rather than leaving tab in code (configurable)
* Do *not* indent unless called for
* A line which is not indented is assumed not to be in a block of code

In [None]:
# Run this cell

# Example 1:

# The function prints two values:
def function_good():
    print(1)
    print(2)
    
# This cell successfully defined a function and has no output

In [None]:
# Run this cell

# Example 2:

# Author intended to print two values in the function, but the python interpreter thought that print(2) is a separate statement *outside* the function
def function_bad():
    print(1)
print(2)

# This cell has a mistake that it defines the function incompletely and runs the print(2) statement

As you go through the tutorial be attentive to whitespace.

# Variables

* Variables are containers to hold data

In [None]:
# Run this cell

# Assign to variable by placing on left side of equation
a = 3

# Use variable on right side of equation
b = a + 4

# Or use variable in function call
print(b)

## Strings, Ints, Floats, Boolean - types of information held by variables

* Strings are in single or double quotes:
  * a = 'happy'  <- the string 'happy' = "happy" is assigned to variable a
* Integers (ints) have no fractional component
  * b = int(2.5)  <- b will have value 2 which is an integer
* Floats (doubles) have fractional component
  * b = 2.5       <- b holds the value 2.5
* Boolean are True/False with a capital first letter
  * c = True      <-  C is marked as True and can be used in a condition

In [None]:
# Run this cell

# Notice how the boolean c can be used to determine whether to print out a value
c = True
b = int(2.5)

if c:
    print(b)

The above cell demonstrated a first type of *control flow*: if statements

## F-strings

* Again we introduce f-strings as a way to cause python to substitute a value in a string
* The letter f must occur before the string
* In a f-string, all curly braces are interpreted as calls for evaluation

F-strings are used liberally when coding below

In [None]:
# Run this cell

blah = 25
print('This is not a f-string {blah}')
print(f'This is a f-string {blah}')


# Control Flow

## Control Flow: If / Else / Elif

* if <condition>:      
  * Evaluate a condition and if true go into the next block (indented)
* elif <condition>:
  * Shorthand for "else if" in other languages.  If the prior condition was false, also evaluate this condition and if it is true do that block of code
* else:
  * If none of the conditions above were true execute the next block of code

In [None]:
# Run this cell

a = 25
b = 20
c = 15

if a < b:
    print('a < b')
elif b < c:
    print('b < c')
else:
    print('none of the conditions were true')

## Control Flow: For loops

* For loops iterate over a list or iterator.  
* We will introduce lists in more detail below.  The purpose here is to notice the for syntax
  * Notice the colon at the end of the iteration line
  * Notice the indentation within the loop

In [None]:
# Run this cell

for i in [1, 2, 3, 4]:
    print(i)

## For loops can use "continue" to finish one pass through the loop and go back to the top

In [None]:
# Run this cell

a = [1, 2, 3, 6, 5, 25, 50, 755, 3, 100, 243, 7, 5, 6, 7, 5]

for i in a:
    # Note the use of '==' to test equality, which is different from '=' for assignment.  In SQL we use '=' for both!
    if i == 3:
        continue
    print(i)
        

## Exercise: write a for loop similar to above, but "continue" on values between 4 and 8 (i.e. 5, 6, or 7)

* Comparison operators
  * \> greater than
  * \>= greater or equal
  * \< less than
  * \< = less than or equal
* Combine two operators in condition
  * (x < 3) and (x < 5): tests whether both conditions are True
  * (x > 3) or (x < 5): tests whether either condition is True

In [None]:
# Your code here




## For loops can use "break" to stop execution

In [None]:
# Run this cell

for i in a:
    print(i)
    if i > 4:
        break

## Exercise: write a similar loop but break the loop as soon as the value = 243 is encountered in the list

In [None]:
# Your code here




# Functions - The Building Blocks of Python Programming

* Functions are blocks of code which can be re-used
  * Arguments are provided to allow the code to have more generalizability.  These arguments to the function are variables within the function.
  * Functions often return values which then can be assigned to a variable or used in a larger code context


### For example:
* def func(a, b)  <-  a and b are arguments (variables) which may be used in the function
* If the function has a "return" statement, the function stops execution at that point and returns the value to the caller

### Syntax
* def declares the function is being defined
* The function_name comes next.  That function name is then assigned the block of code
* The arguments are listed in parentheses
* A colon is required at the end of the function declaration
* Indentation is required for the code in the function.  If a line is not indented it will be presumed to be outside the function scope.
* If a function has a return statement, it stops execution of the function and whatever is with the return statement will be assigned to a
left-side variable if any

In [None]:
# Run this cell

# function to double a number
def double_number(num_to_double):
    return num_to_double * 2

# call double_number and store the result in variable x
x = double_number(3)
print(x)

Please note the *return* in the code above.  This allows the function to return a value which can then be used.  
In the double_number function above it returned a new value which is twice the original number.  This return value was then assigned to variable x and it is available as variable x from then on.

## Exercise:

* Write a function which takes two arguments and *returns* their product

In [None]:
# Your function code here





In [None]:
# Test your function here and assign result to a variable and print the variable




## Exercise:

* Write a function which takes two arguments and returns the smaller of the two values
* You can use an *if* statement in the function to compare values and determine what to return

In [None]:
## Your function code here





In [None]:
# Test your function here

### When a function hits a return statement it is done

In [None]:
# Run this cell

def func(a, b):
    return 25
    c = a + b
    print('function got here')
    return c

func(3, 4)

# Note that the function stopped as soon as the return statement was encountered

### Functions can have default arguments which are useful so not every argument has to be provided by the caller

* Best practice is not to specify default arguments in your function call unless needed.

In [None]:
# Run this cell

def multiply_three_numbers(num_a, num_b=1, num_c=1):
    return num_a * num_b * num_c


print(multiply_three_numbers(2, 2, 2))

# here we do not pass in num_c
print(multiply_three_numbers(4, 5))

# here we only pass in num_a
print(multiply_three_numbers(8))

## Exercise:

* Write a function which takes two arguments and returns their product, with default values for the arguments (your choice) in case the function call does not provide them.  
* Call the function passing 0, then 1, then 2 arguments

In [None]:
# Your code to define function here

In [None]:
# Your code to call function three different ways here

### Unnamed and named arguments in a function call

* If you do not name arguments when calling a function, python will assign to arguments in the call sequence
* If you do name arguments, then those names determine which are assigned
* If you mix named and unnamed arguments, all unnamed arguments must occur before named arguments

In [None]:
# Run this cell

def double_triple_etc(duple=0, triple=0, quadruple=0):
    return duple*2 + triple*3 + quadruple*4

print(double_triple_etc(triple=3))

print(double_triple_etc(quadruple=5))

print(double_triple_etc(1, quadruple=2))

In [None]:
# not allowed because unnamed argument is used after named argument
# python would not know what is intended:
print(double_triple_etc(triple=2, 3))

# Exceptions and Errors

* When a programmatic error occurs an exception is thrown
* The exception stops the function and sends a signal back to the calling function

In [None]:
# Run this cell

1/0

## Exceptions can be handled by catching them

* Use with care as this may result in hiding important errors.
* try block has indented code, and any error in that code jumps to the exception catching block
* The exception catching block should specify the exact type of exception to catch.  Typically you have to trigger that exception type to find out it's name
* Any uncaught exception will terminate your code.

In [None]:
# Run this cell

try:
    a = 'apple'
    x = 1 / 0
    xx = 'banana'  # Line of code will notn be encountered due to error in line above which then transfers control to 'except' block
except ZeroDivisionError as err:
    print(f'Ignoring error {err}')

print('Note that apple was defined, but the code never defined banana, so now a new error was caused and not caught')  
print()
print(a)
print(xx)

### Exceptions and call stacks

When functions call other functions, which call other functions, the exception may produce a call stack.  Reading the call stack helps determine where the error occurred.

In [None]:
# Run this cell

def double(x):
    # this function has an obvious error:
    return x / 0

def add_numbers(x, y):
    return x + double(y)

def add_for_list(list_of_num):
    for i in list_of_num:
        print(add_numbers(i, i+1))
        
add_for_list([2, 3, 4])

In the above stack dump, notice how the type of exception is at the bottom, the line where the exception has occurred is reported near the bottom, and the top of the call stack is the first function call.

In diagnosing this issue, the most relevant information may be near the bottom.  But not necessarily.

# Data Structures

## Lists - A principal data strucutre for python

In [None]:
# Run this cell

a = [1, 2, 3, 4, 5]
b = [2, 3, 4, 5, 7, 9]

print(f'a is {a}')
print(f'b is {b}')

## List functions:

* list.append() adds an item to the end of the list
* list.pop() removes the last item from the list
* list.remove() removes an item from the list
* list+list adds lists together

## Exercise: Write a function which takes two lists, adds them together and returns the result as a single list

In [None]:
# Your function code here



In [None]:
# Your code to call the function here using the lists a and b defined above
# Notice that items are not eliminated



## Exercise: Write a function which sums all the numbers in a list (one argument) by iterating over the list and returns that sum

* Remember the syntax for iterating over the list one at a time:

```
for i in list:
    do_something_to(i)
```    
    

In [None]:
# Your function code here

In [None]:
# Call the function here

## Lists can be accessed with array accessors

* [i:j] - get elements starting at position i and ending just before position j
* [i:] - get all elements starting at i
* [:j] - get all elements up to (excluding) j

In [None]:
# Run this cell

a = [1, 1, 1, 2, 3, 4, 4, 4, 4, 27, 34]
a[3:8]

## Exericse: Get the first 5 elements of list

In [None]:
# Your code here

You can iterate a list in reverse with a third argument = -1 for direction

In [None]:
# Run this cell

a = [1, 1, 1, 2, 3, 4, 4, 4, 4, 27, 34]
a[::-1]

## Sets

* Not as often used as lists.  
* Eliminates redundant elements in lists
* Useful for taking unions and intersections to find unique elements

In [None]:
# Run this cell

aset = set(a)
bset = set([2,3,4,5,7,9])
print(f'Intersection: {aset.intersection(bset)}')
print(f'Union: {aset.union(bset)}')

In [None]:
# Run this cell

# Sets can also be useful for removing redundant
a = [1, 1, 1, 2, 3, 4, 4, 4, 4]
print(set(a))

## Tuples

* Tuples look like lists, but typically have a fixed number of elements
* Generally prefer lists to tuples.  But you may encounter tuples in some cases. 

In [None]:
# Run this cell

a = (2, 3, 4)
print(f'Element 0 of a is {a[0]}')

## Exercise: Write a function which takes two arguments which are lists and returns a tuple of lists

* The first element of the tuple will be the union of the two parameters
* The second element of the tuple will be the intersection of the two parameters

In [None]:
# Your function code here

In [None]:
# Invoke your function here

## Dictionary - Another principal data structure

* Dictionaries have keys and values
* They can be used to count elements matching a key
* They can be used to keep track of things and look them up later

### Keys are "entries" in the dictionary
### Values are what you find in the dictionary at the Key

For example, a key could be an identifier and a value could be related information for the identifier

In [None]:
# Run this cell

# Create a dictionary:

adict = {
    'apples': 2,
    'bannanas': 3
}

print(f'How many apples do you have? {adict["apples"]}')
# Note: I had to put the apples in double quotes above to not confuse with the single quote around the whole thing

## Exercise: Create a sample dictionary where each key points to another dictionary which has three elements

* Keys will be the names 'Mike', 'Mark', and 'Fred'
* Values for each will be a list:
  * First element 'hair-color' and it will be a string
  * Second element: 'height' which will be a height in inches
  * Third element: 'current' which will be a boolean for whether the person is a current customer
  
Here is how the data should look:
* Mike - brown / 72 / True
* Mark - black / 70 / False
* Fred - red / 71 / True

Create the dictionary of dictionaries and then print it

In [None]:
# Your code here

The purpose of the above exercise is simply to show that dictionaries can contain complex values. 

## Use dictionary to count occurrences in list

In [None]:
# Run this cell

a_list = [1, 2, 3, 4, 3, 2, 1, 4, 5, 6, 7, 8]

# Create an empty dictionary
a_dict = {}

for i in a_list:
    # See if an entry exists in the dictionary already
    if i in a_dict:
        # Yes it does, add one each time
        # 
        # The syntax "+= 1" is the same as taking the value and adding one more
        a_dict[i] += 1
    else:
        # It does not yet exist, create a dictionary entry
        a_dict[i] = 1
        
print(f'The frequency is in the dictionary {a_dict}')
print(f'The frequency of fours in the data set is {a_dict[4]}')

## Exercise: Create a *function* which does what the code above does

* Takes an argument which is a list
* Creates an empty dictionary and then populates that dictionary with keys=list items, and values=count of occurrences of list item

Hint: You will basically use the code above and provide a function wrapper around it, need to indent it, and make sure the argument name in the function is the name of the list in the code.  When done return the result

In [None]:
# Your function code here



In [None]:
# Test calling your function here


## Exercise: Write a *function* which takes one argument which is a *list of lists*

* Similar to the exercise above, but this time instead of taking one list, take an argument which is a list of lists, and go through each list and do the same thing (counting up all the elements of the inner lists into a single sum across all lists.)

In [None]:
# Your function code here




In [None]:
# Test calling your function here


## defaultdict - a dictionary with default values for when keys are not specified

* Default dictionary is a dictionary which creates an entry as soon as you ask for it
* This is a handy class to know about if you want to write code like above with less complexity

Note: defaultdict is handy, but one can use python and never use this structure



In [None]:
# Run this cell

from collections import defaultdict
ice_cream_preferences = defaultdict(lambda: 'Vanilla')

print(f'If you do not tell what Bob is interested in, it will guess Vanilla')
print(f'What flavor of ice cream does Bob like? {ice_cream_preferences["Bob"]}')

ice_cream_preferences['Mark'] = 'Chocolate'
    
print(f'Mark likes {ice_cream_preferences["Mark"]}')
print(f'Steve likes {ice_cream_preferences["Steve"]}')
      

### Use defaultdict rather than built in dictionary to sum list

In [None]:
# Run this cell

# This default dict can be used so you can start with a type like int, and use the fact that
# the default int has value 0 so you do not have to see if the key exists already.

from collections import defaultdict

# empty dictionary
a_dict = defaultdict(int)

a_list = [1, 2, 3, 4, 3, 2, 1, 4, 5, 6, 7, 8]

for i in a_list:
    # One can increment the value as it will be defaulting to 0 if it does not exist already
    a_dict[i] += 1
        
print(f'The frequency is in the dictionary {a_dict}')
print(f'The frequency of fours in the data set is {a_dict[4]}')

## Exercise: Write a function which takes a list of lists and returns a defaultdict with keys=unique list values, values=count of occurrences of each list value

Exercise purpose:
* List of lists as a new type of argument
* Simple application of defaultdict

In [None]:
# Your function here



In [None]:
# Call the function here



# Pandas - Dataframe from dictionary or list

### DataFrame from dictionary

In [None]:
# Run this cell

# Remember we always import pandas "as pd" by convention
import pandas as pd

data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)

### DataFrame from list

* Include columns parameter to give it a name

In [None]:
# Run this cell

list_fruit = ['apple', 'banana', 'grapes', 'grapefruit']
# columns in list since there could be more than one:
pd.DataFrame(list_fruit, columns=['fruit'])

### DataFrame columns can be obtained and converted to a list for processing

In [None]:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'col_3': ['k', 'w', 'y', 'z']}
x = pd.DataFrame.from_dict(data)
x_col = x.columns
x_col_list = x_col.tolist()
print(x_col_list)


## Exercise: print the column list above in revere order

In [None]:
# Your code here





### Pandas - Series can also be created from lists, but typically one would go straight for the DataFrame since it has expanded functionality

In [None]:
pd.Series(list_fruit)

# End of Core Tutorial

## List Comprehensions

* Python has a fancy way to walk over a list in one line called a "list comprehension".  It is an example of "pythonic" syntax, and you will run across this and may find it useful.
* For the beginner user, your goal might be to know list comprehensions exist even if you cannot reproduce the syntax
* For the advanced user, you will use list comprehensions to simplify your code and logic

In [None]:
# Run this cell

# Create a list
a = [1, 2, 7, 9]

# The list comprehension:
[i+2 for i in a]

The list comprehension above has a structure where it iterates through the list a (right part) and executes some function (left part)

* List comprehnsions can be fancier with if clauses:

In [None]:
# Run this cell

[i+2 for i in a if i < 4]

The list comprehension above applies only to some elements in the list based on an if condition (like a where clause in SQL)

## Exercise: Write your own list comprehension which squares every element in a list which has value greater than 3

## datetime library

* datetime library is useful for getting information on time for knowing what time an event occurs

In [None]:
# Run this cell

import datetime

# strftime function denotes how to conver the time into a string
current_time = datetime.datetime.now().strftime('%H:%M:%S')
now_as_string = datetime.datetime.now().strftime('%y/%m/%d %H:%M:%S')

print(current_time)
print(now_as_string)

## Exercise: Get datetime with four digit year and changing format to year-month-day.  Drop the seconds

In [None]:
# your code here