# Research Computing Boot Camp
# Boston University 

Website: [rcs.bu.edu](http://www.bu.edu/tech/support/research/) <br>
Tutorial materials: [https://github.com/bu-rcs/bu-rcs.github.io/tree/main/Bootcamp](https://github.com/bu-rcs/bu-rcs.github.io/tree/main/Bootcamp)



# Python Part 1: Language Basics and Example Dataset

Note:  This tutorial is a short introduction to the Python language and the Jupyter notebook development system.  

The regular RCS Python tutorials are more in-depth and cover more features of the Python language along with debugging techniques, good software development practices, optimization, and more.  Feel free to sign up for any of those in addition to these Boot Camp tutorials.

# The Python Language



This tutorial will cover the basics of Python and introduce the data set we'll be working with.  The topics we'll cover are:

* Variables
* Functions
* Lists and Dictionaries
* Loops
* Conditionals
* Libraries
* The University of Wisconsin Population Health Institute County Health Rankings data set
* Putting it all together...

## Python Language References

The main Python site:  https://www.python.org/

Language reference: https://docs.python.org/3.8/

### What is Python?

Python is a general purpose, interpreted, object-oriented programming language. It was designed to be easy to learn and use while supporting sophisticated programming techniques and good performance.

In an *interpreted* language a program, here the Python interpreter, reads Python code from a file or from a command line and executes the code. This is contrasted with *compiled* languages (ex. C++, Fortran) where a program, called the compiler, converts program code from a file into executable code which is then run directly by the computer.

### Jupyter Notebooks

The Jupyter Notebook provides an interface to the Python interpreter in a highly interactive and easy to modify environment. Program code, comments, descriptive text like this section, plots, and calculation results can all be stored in the notebook. The Python code written in a notebook can be extracted into a regular text file to bring to a different development environment if desired.  You can read more about Project Jupyter here: https://en.wikipedia.org/wiki/Project_Jupyter 

In the Jupyter system a web server runs a Python interpreter and passes code and results back and forth between your browser and Python.  This allows for the Python process to run on a different computer than your browser.  In the SCC OnDemand system Python runs on a compute server with significantly more memory, disk space, and CPU capability than a typical desktop or laptop.  A notebook can be developed on your personal computer on a small dataset and then moved to a remote system to process a potentially very large dataset.  

### JupyterLab

JupyterLab is an updated Jupyter Notebook interface. It provides some extra features such as a file browser that make it easier to work with multiple notebooks.

### IPython

Jupyter notebooks actually run a variant of the standard Python interpreter.  This is called IPython.  It adds a number of features to the Python interpreter that make it easier to use and much easier to integrate into a system like Jupyter.  It does not add to or alter the Python language in any way.

## Using the Notebook

Notebooks can have 3 types of "cells": code, markdown, and raw.  The type of a cell can be selected with the dropdown menu at the top of the notebook.

* Markdown cells are used for descriptive text, images, links, and so on.  They are formatted using Markdown syntax: https://en.wikipedia.org/wiki/Markdown   When a Markdown cell is run it will be formatted and displayed.

* Code cells are for Python code. When code cells are run they execute Python code. If the last line of code returns a value it will be printed automatically.

* Raw cells are plain text. Running raw cells has no effect other than displaying the text.

**To run or execute a cell** you can click the Play icon (triangle on its side) at the top of the notebook or press the Enter key while holding down the Shift key.

# Variables
We can assign values to labels, a.k.a. variables...

In [None]:
a = 1

In [None]:
# here is how we can print that value
print(a)

In [None]:
# this is a comment because it starts with a #. This is ignored by Python.
b = a + 1   # Use the value stored in the variable a. Also comments can come after code.

In [None]:
print(b)

In [None]:
# We have to create a variable before we can read a value from it.
b = zzz - 1

In [None]:
# Cells can hold more than one line of code of course.
a = 2     # re-define the variable a to store a new value
b = 3
c = a *  b  # Multiplication
print(c)
# Try re-writing "c = a * b" with different operators and re-running the cell
# to print the value again.
# Arithmetic operators:  + - * / ** // % ==  ()
# example:   c = a / b   
#      or:   c = 2 * a / (1 - b)
# What do they all do?


In [None]:
# Variable names must start with a letter (lower or uppercase) or a _  After that digits may be used.
_this_is_ok = 0
a123 = _this_is_ok # ok
tHiS_fInE_2 = a123 # sure!

# Variable Data Types
 
Python has many built-in data types and it's possible to define your own.  The basic types are:

## Integers
Standard integers are 32-bit (4 byte) in size with a range of -2,147,483,648 to +2,147,483,647. Python will automatically use larger data types (64-bit) for larger integer values.

In [None]:
a = 22
b = -2398213093
c = a * b  # integer * integer = integer
print(a,b,c) # print 'em all

## Floating point

aka any number with a decimal place.  Standard floating point numbers are 64-bit (8-byte) "double precision" values with a range of ±2.23×10<sup>−308</sup> to ±1.80×10<sup>308</sup>. 

In [None]:
a = 1.0
b = 2.  # ok to leave off the trailing 0 but easier to mis-read
c = 1 / 2  # Division of integers gives a floating point result
d = 0.5 * 4  # integer with a float in an operation gives a float
print(a,b,c,d)

# Make a large number
e = 3.5e20   # exponents are labeled with an 'e'
d = e / 2.1023e21 
print(e,d)

## Complex Numbers
Same as floating points with an imaginary portion. Labeled with a 'j'.

In [None]:
a = 1.0 - 2.1j
b = 2 * a  # real times a complex gives a complex result.
c = complex(3.9,0.15e3) # alternate construction
print(a,b,c)

## Strings
Character strings are labeled with matched pairs of '' or "".  The quote not used to define the string can be used inside the string.  Triple quotes in matched pairs ''' ''' or """ """ are used for multi-line strings.

In [None]:
a = 'This is a string.'
b = "So is this."
print(a,b)
c = "Concatenate this with a and b: " + a + b  # String addition
print(c)
d = "Nice day, isn't it?"   # double-quoted string allows for a ' inside.
e = '"yes," he replied."'  # single-quoted string allows for " inside.
print(d,e)
print('************') 
# And here's a triple-quoted string. Note that single or double quotes can be freely used inside
# the multi-line string.  Also, the spaces and newlines are retained when it is printed.
f = '''
'She can't do Addition,' the Red Queen interrupted, 'Can you do Subtraction? Take nine from eight.'

"Nine from eight I can't, you know," Alice replied very readily: 'but—'

'She can't do Subtraction,' said the White Queen. 'Can you do Division? Divide a loaf by a knife—
what's the answer to that?'

'I suppose—' Alice was beginning, but the Red Queen answered for her. 'Bread-and-butter, of course.' 
                                                                         --- Alice in Wonderland
'''
print(f)


## Special String characters
A backslash \ is sometimes placed in front of a character in a string to indicate a special value is to be substituted.  This is called an escape sequence. Some common ones:

| Escape Sequence | Meaning |
| --- | --- |
| \n | newline |
| \r | carriage return |
| \t | tab |
| \uXXXX | Unicode character with 16-bit hex value XXXX |
| \UYYYYYYYY | Unicode character with 32-bit hex value YYYYYYYY |

In [None]:
# A string with new lines:
a = 'This is a line.\nHere is another.'  
print(a)
# and now print one with a carriage return.  Yes, this refers back to the days of electric typewriters:
print("what will\rhappen?")
# And now print some special Unicode characters.
print('\u20AC   \u00A3   \u00E5   \U0001F99E')

# A prefix of a lowercase u before the string tells Python you want the string interpreted
# as 16-bit unicode.  This is required in plain Python - here in the Jupyter notebook the prefix u
# can be left out.
print(u'¡Python es divertido!')

## Boolean types  
These are true/false values. Use the values of True and False - the capital letters matter!

In [None]:
a = True
b = False
# Try out some Boolean operators:   and  or  not.  Parantheses work too.
c = a and b
d = a or b
e = not c or not (b and d)
print(a,b,c,d,e)


# Functions
Functions let us write a bit of code and then re-use it repeatedly.  Python also provides a large number of functions. Many data types have their own functions built into their definitions as well.

Syntax:

<code>def function_name( variables,passed,to,function):
   code for the function here
   return  # optional return command</code>

The **indentation** of the lines creates a "code block."  Python knows the function definition has ended when the indentation has ended.  The indentation for a code block must be consistent or there will be an error.

In [None]:
# Example of a simple function:

def add_two(a):
    b = a + 2
    return b    # OR: do in one line...  return a+2

# Call the function AFTER it has been defined.
b = add_two(4)
print(b)
c = add_two(b)
print(c)   # print() is a built-in function we've already been using.

# The argument is optional but the () is required. If no value is returned
# the return line can be omitted:
def print_msg():
    print("Hello world!")
    
print_msg()

In [None]:
# Slightly more complex function.
def parabola(x,a,b,c):
    tmp = a * x**2 + b * x + c
    return tmp

# Feed function outputs right into other functions
print(parabola(1,2,3,0.5))

# Here's an alternate definition with default values:
def parabola_2(x, a=2.5, b=3.1, c=0.0):
    tmp = a * x**2 + b * x + c
    return tmp

print(parabola_2(0.5))
print(parabola_2(0.5, 1)) # x=0.5, a=1.0 Defaults: b=3.1, c=0.0
print(parabola_2(0.5, c=-10.4)) # x=0.5, c=-10.4  Defaults: b=3.1, a=1.0

# As with variables we must define a function before we can use it.
z=my_new_function(1)

## Python built-in functions

Try this function which lists functions and variables:

In [None]:
dir()

In [None]:
# Get some help:
help()

## Functions built into data types.

Python is an *object-oriented* language. Briefly, this means that data and functions that work on that data can be bundled together in a structure called a *class*.

Strings are an example of a class.  Each Python string can access a set of functions that operate on that string.

In [None]:
a = "This is a string."
# What functions come with a string variable?
dir(a)

In [None]:
# Try some!
print(a.upper())
print(a.title())

In [None]:
# what do they mean?
help(a.find)
a.find('s')

### Important: Strings are immutable
Python strings cannot be changed once they are created. Functions on strings make new strings.

In [None]:
a="hello world"
b = a.upper()
print(a) # unchanged
print(b) # new string

# Lists and Dictionaries

Lists and dictionaries are generic data types built into Python.

## Lists
A list is a 1-D sequence of variables and/or values.  A list can store any data type at any place in the list, including other lists, dictionaries, functions, etc. Few Python programs are written that don't use at least 1 list variable at some point.

In [None]:
# Make an empty list
a_lst = []
# OR equivalently
a_lst = list()

# Make a list with a few things in it
b_lst = [4,5.0,[],'cat']

# Refer to an element of a list
c = b_lst[0]
print(c)


In [None]:
# In Python we count from 0.
print(b_lst[0], b_lst[1], b_lst[2], b_lst[3])

In [None]:
# Count backwards from the end of the list with -1
print(b_lst[-1], b_lst[-2], b_lst[-3], b_lst[-4])

In [None]:
# list slicing:  list_name[start:end:step]
b_lst[0:2]  # Exlusive of the end index, this gives elements 0 and 1

In [None]:
b_lst[0:1] # And this is element 0 but inside a list, try b_lst[0] to compare

In [None]:
# Let's make a longer list.  range(start,end,step) is a function that makes a sequence of numbers.
# convert it to a list.
c_lst = list(range(5,25))
print(c_lst)

In [None]:
# More slicing.  Slice from the end element backwards 2 steps. 
print(c_lst[-1:-4:-1])  # elements -1, -2, -3

In [None]:
# start: means from the start to the end
print(c_lst[15:])

In [None]:
# :end means from 0 to the end index
print(c_lst[:5])

In [None]:
# Print every 3rd element:
print(c_lst[0::3]) 

In [None]:
# Left-hand-side slicing overwrites several list elements
c_lst[0:4]=[-99,-100,-101,-102]
print(c_lst)

In [None]:
# Replace an element
c_lst[10]="-10,000"
print(c_lst)

# What happens if we do an assignment to an index that doesn't exist?  Uncomment and find out.
#c_lst[1000] = 30


In [None]:
# Add to the end of a list
a = []
a.append(20)
a.append(30)
a.append(40)
print(a)

In [None]:
# Join two lists
b = [4,5,6]
c = a + b
print(c)
# Add a again.  += means "add to myself"
c += a
print(c)

In [None]:
# Delete an element from a list:
b = [4,5,6]
del b[0]
print(b)

## Tuples

Tuples are just like lists except their contents cannot be changed after they're created.  Create using optional ( ).

In [None]:
a = (1,2)
print(a)
# Index as usual
print(a[0])
# Or create without parentheses
b = 1,2


In [None]:
# Make a tuple from a list.
a = [1,2,3,4]
b = tuple(a)
# Or a list from a tuple
c = list(b)

In [None]:
# But a tuple cannot be changed...
b[0] = 'whoops'

## Dictionaries

Dictionaries store data by associating a key with a value.  Keys can be almost any data type (not lists or dictionaries), values can be anything.

In [None]:
# Make an empty dictionary with a pair of {}
a = {}
# Initialize a dictionary with {key:value, ...etc...}
b = {1:2,'cat':'dog',-1.2:[1,2,3]}

In [None]:
# Look up values in a dictionary with a key
b[1]

In [None]:
b['cat']

In [None]:
# Add a new key to a dictionary or replace a stored value:
b['dog'] = 'bird'
b[1] = 200
print(b)

In [None]:
# What happens if we try to access a key that does not exist?
b['uh-oh']

In [None]:
# Check to see if a key exists.  We'll revisit this when we look at "if" statements.
print('uh-oh' in b)

In [None]:
print(-1.2 in b)

In [None]:
# Join two dictionaries
a = {1:2}
b = {3:4}
# We can't do a+b, that's undefined.  The dictionary a can incorporate b with the update function
a.update(b)
print(a)

In [None]:
# OR...use some Python funky syntax to join a and b into a new dictionary. This means "use all keys and values":
c = {**a, **b}
print(c)

In [None]:
# Delete an element from a dictionary
b = {1:2,'cat':'dog',-1.2:[1,2,3]}
del b[1]
print(b)

# Loops

Loops let you do things over and over. Loops let you do things over and over. Loops let you do things over and over. Loops let you do things over and over.... 

There are three main kinds of loops: for, while, and list comprehensions.  Loops can be implemented with recursive functions as well - those are functions that call themselves.

## For Loops

A Python for loop is used to loop through a collection (list, tuple, dictionary keys, etc) and lets you do something with each element of the collection.

Syntax:

<code>for x in xyz:
   ...something with x
   ...something else
</code>
Again, the indentation tells Python which lines belong to the for loop.

In [None]:
# Make a list to loop over.
a = ['cat','dog','parakeet','goldfish','earthworm']
# Print each element
for elem in a:   # elem is the variable name you choose to refer to each element in the collection
    print(elem)

In [None]:
# This time loop and print out their upper-cased versions.
for x in a:  # call your loop variable anything you like
    print(x.upper())
    


In [None]:
# For loops can be of any length, can contain function calls, and can create or use any variables in the loop.
# Changes to the looping variable don't effect the collection, mostly. 
# However - don't change the collection length (i.e. delete elements from it) while looping!
for x in a:
    x=2
    print(x)
print("a was not changed:")
print(a)

## While

While loops use a Boolean value and run as long as the Boolean value is True.  Indentation is used again, just like functions and for loops.  If the condition is never False than the loop will run forever!
    

In [None]:
x = 4
while x < 10:
    print(x)
    x += 1   # += means "add 1 to myself"</code>

In [None]:
# These can be used to loop over a list with an index number:
idx = 0
a=[5,6,7,8,9]
# len(a) gives the length of a list...or tuple, dictionary, string, etc.
while idx < len(a):
    print (a[idx] * 2)
    idx += 1  # DON'T FORGET!

## List Comprehensions

These are incredibly useful for converting one list to another. The syntax is a combination of a list with a for loop.


In [None]:
a = [1,2,3,4,5] # Start with a list or similar variable.
b = [2 * x for x in a]  # Call each element of 'a' as 'x', multiply it by 2, and make a new list assigned to b.
print(b)

In [None]:
# Or sum a list of lists.  Use the built-in sum() function.
a = [ [1,2,3], [4,5,6], [7,8,9,-1,2.3,4.5]]
b = [sum(x) for x in a] # [sum([1,2,3]), sum([4,5,6]), ...etc]
print(b)
# now print the sum of b:
print(sum(b))

# if what we REALLY want is the sum of everything in a, pack it all together
total = sum( [sum(x) for x in a] )
print(total)

# Conditionals

Conditional statements have the form: "if true do X else do Y". These can be combined with loops and function calls.

Syntax:

<code>if True:
    ...do this...
else:
    ...do something else..</code>

In [None]:
a = 5
b = 3
if a < b:
    print("a is less than b")
else:
    print("a is greater than b")

In [None]:
# now if a==b what happens?  Edit the above cell and re-run it. 
a = 5
b = 5
# Let's add another if statement.  Change up a and b and get all 3 statements to print.
if a < b:
    print("a is less than b")
elif a > b:
    print("a is greater than b")
else:  # 'else' is the default when nothing else was true
    print("a is equal to b")
    
# Using else and elif are OPTIONAL - it's ok to have an if statement by itself.
if 10 > 0:
    print(u"I \U0001F493 Python!")

## Conditionals With Loops

Add an if statement to a for loop...

In [None]:
a = list(range(20))
for b in a:
    if b % 2:
        print('%s is odd' % b)  # This is a "string substitution" - b is inserted in place of the %s
    else:
        print('%s is even' % b)
        

In [None]:
# Combine with the "break" statement to stop the loop.
a = list(range(20))
for b in a:
    if b == 10:
        break
    print(b)

In [None]:
# The "continue" statement forces the loop immediately to the next iteration
a = list(range(20))
for b in a:
    if b < 10:
        continue
    print(b)

# Python Libraries

Python was designed to make it easy to create and use software libraries written (usually) in Python.  Some libraries contain compiled C, C++, or Fortran code but are constructed so that they work just like Python code when you use them.  

The Python language standard defines a large collection of these libraries, referred to as the Python Standard Library:
https://docs.python.org/3.8/library/

The Anaconda Python distribution includes a few hundred more. The *python3* modules on the SCC include several hundred Python libraries as well.  You can find a massive collection of them at the PyPi site:  https://pypi.org/ 

On the SCC you can follow these instructions to install your own Python libraries to use on your projects: 
https://www.bu.edu/tech/support/research/software-and-programming/common-languages/python/install-packages/

To load a library, use the *import* command:

In [None]:
import math

In [None]:
# What's in the math library?
dir(math)

In [None]:
# what do they do?
help(math.sqrt)

"import math" loads the math library which contains a bunch of useful math functions.
Note the "math.sqrt" form.  The name "math" is called a namespace, which is used to keep library 
functions separate from your own.


In [None]:
def sqrt(x):
    print('My sqrt is wrong!')
    return x / 2 # don't ever do this in a real program

print(sqrt(10))
print(math.sqrt(10))

## Library Import Renaming

The import command can be used to re-name functions when they are imported or to discard the namespace altogether.


In [None]:
# Rename a function without the namespace
from math import factorial as factorial
factorial(4)

In [None]:
# Or just import the entire math library without any namespace at all
from math import *
sqrt(42) # which sqrt is called? The one from math or our function?

# The University of Wisconsin Population Health Institute County Health Rankings data set

Here's our data set:
https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation

This includes a broad array of health, environment and population measurements for the entire USA on a county-by-county level.  Let's load a piece of this dataset into Python and try out some of what we've learned on it.


In [None]:
# First we'll download a sub-section just for the New England states (Massachusetts, Rhode Island, Connecticut, Maine, New Hampshire, Vermont)
# Import the csv library for handling a file in CSV format, and the urllib2 library for handling a URL.
import csv
import urllib.request
import codecs # 

url = 'https://raw.githubusercontent.com/bu-rcs/bu-rcs.github.io/main/Bootcamp/Data/NE_DemographicsData.csv'
# Call a function in urllib to read the remote file into a variable.
ftpstream = urllib.request.urlopen(url)
# Use a function in the codecs library to decode the csv file line-by-line into the right string format.
# urlopen returns binary bytes, not strings, so it has to be decoded before we can use it.  The csv.reader
# function will also break up each line by the commas into a list.
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
data = list(csvfile)  # convert to a regular list of lists.

You can follow this link to see the layout of this file:
https://github.com/bu-rcs/bu-rcs.github.io/blob/main/Bootcamp/Data/NE_DemographicsData.csv


### Goal: get a total population for each state  
The row with the county name "NA" already gives the total, but as an exercise we'll compute it from the population-per-county.


In [None]:
# data is a list, each element is a line of the file:
print(data[0]) # this is the header row
print(data[12]) # here's the 12th row

In [None]:
# Counting columns we'll need column index 1 (state), index 2 (county), and index 9 (population)
# Let's store the totals in a dictionary where the keys are the state name.
# How do we get the states?
# Method 1:  pops = {'Massachusetts':0, 'Connecticut':0}
# but that gets boring quickly, and what if the data changes?  Or we want to use a different set of states?

In [None]:
# Method 2: let Python figure it out for us by taking advantage of the dictionary.
pops = {}  # Initialize a dictionary.
# Now let's loop through the data, and for each row try to put the state into the dictionary.
# If it's not there, add it with a population initialized to 0. If it is there, do nothing.
# remember data[0] is the header row so start at the next one and go to the end.
for row in data[1:]:
    # each row is a list already...
    ##### what do we do here?
    

In [None]:
# Method 2 solution...click on the blue line to the left of the dots to expand the solution.


In [None]:
pops = {}  # Initialize a dictionary.
for row in data[1:]:
    # each row is a list already...
    state = row[1]
    if state not in pops:
        pops[state] = 0 # add it and initialize the value to 0.
    # else...do nothing.


In [None]:
# Now let's loop through the data again and add the county populations...if the county name is not NA
for row in data[1:]:
    #....?

In [None]:
# expand the dots to see the solution

In [None]:
for row in data[1:]:
    state = row[1]
    county = row[2]
    population = row[9]
    if county != 'NA':
        pops[state] += int(population)
        


In [None]:
print(pops)


In [None]:
# just for fun - use the pprint library to "Pretty Print" the pops dictionary
import pprint
pp = pprint.PrettyPrinter()
pp.pprint(pops)

In [None]:
# Bonus solution: combine both for loops into 1.

In [None]:
pops = {}
for row in data[1:]:
    state = row[1]
    if state not in pops:
        pops[state] = 0
    if row[2] != 'NA':
        pops[state] += int(row[9])
pp.pprint(pops)