# Programming Basics for Analytical Chemistry

## Learning objectives for today:
    
    
1. Learn some basic programming terminology:

    * variables (and three variable types - float, integer and string)
    
    * simple functions
    
    * comments
    
    
2. Learn how to use simple progamming tools to automate simple calculations

    * Indexes and lists
    
    * For loops
    
    * If-then statements
    
    
3. Practice simple statistics for describing data:

   * Calculate and properly report average, standard deviation and confidence interval for a data set
    
   * Use a Grubbs test to determine if a data point is an outlier
    


## Part 1 - Introduction to coding, and some simple math
You are working in a Jupyter notebook, which allows us to include regular text, called "Markdown", along with our Python code. What you are reading right now is markdown. What you see in the box below is Python code. In the markdown box we can write whatever detailed descriptions we want to write. In the code block, we have to be more careful: the computer will try to interpret anything we write in that block as code, unless we use a # to designate informaiton as a "comment".
So we'll do some math, and use comments to explain what we're doing in the code block below:


In [1]:
# addition
1+2

# subtraction
6-5

# multiplication
15*69

#division
50/5

# exponents
3**3

# operators work for things other than numbers!

'hello ' + 'world'

'hello world'

## Part 2 - Variables and Printing

OK, so far Python seems like a pretty lousy calculator. To do more, we need to store information in variables. There are three important variable types that we will use in this class.

* Integers: Whole number values, abbreviated 'int'
* Floats: Numbers with decimal places, abbreviated 'float'
* Strings: Words, abbreviated 'str'

We can use the command "type" to ask Python to let us know what kind of variable something is. So, which variables in our calculator do you think match these three types? Let's double check!

Once we start saving things as variables, Python will stop printing out the last line of our code. When we want it to print out information stored in a variable, we have to tell it what we want to see. The simplest way to do this is using the print function, which will work with any of our variable types!

Let's define an example of each variable type, and use the print and type commands!

In [2]:
# integer
a = 5

# float

b = 5.0

# string

c = 'five'

# print(variable) to output the variable

print(a)

# type(variable) will tell us what kind of variable it is
type(a)

5


int

## Part 3 - Lists and Indexing

We are usually going to want to work with a lot of numbers, not just one or two. We can store data in lists, and then use those lists to automate repetitive mathematical functions. We'll take some data here that hopefully looks familiar and put it in a list called "data" 

If we print the whole list, we'll get all of the data stored there. 

In [3]:
data = [14.1, 11.0, 13.8, 13.6, 14.8] #ppb

print(data)
type(data)

[14.1, 11.0, 13.8, 13.6, 14.8]


list

If we want to see just one number from a list, we can do that too! However, it is very important to remember <b> Computers start counting at zero! </b>
So for the first item in a list, we need to ask the computer for item 0

In [4]:
print(data[0])

14.1


What should we do to print the third number in this list? Try it here:

In [5]:
print(data[2])

13.8


## Part 4 - For Loops
This isn't very helpful if we have hundreds or thousands of data points; would you want to type a line of code for every single data point when you needed it?

Luckily, we can always automate repative functions!

One way to do this is called a "for loop":
We will ask the computer to continue doing a certain process for each value in a list. The for loop below is set up to print every word stored in the list fruits.

In [6]:
# List of fruits

fruits = ["apple", "banana", "cherry","apricot","blackberry"]
for x in fruits:
  print(x)

apple
banana
cherry
apricot
blackberry


<b>Answer the following questions before you continue: </b> (double click in this box to type in it!)
1. Add another fruit to the list, and rerun the block. Did you have the change anything in the for loop to see all of the names?


## Part 5 - Apply lists and for loops to simple mathematical functions

A data set that has more than one number will often be stored in a list, and we use for loops to manipulate those lists, and carry out complicated mathematical functions. Today we'll write the code for two simple examples: averages and standard deviations!

### Using Python to calculate an average:

<b> Answer the following question before you continue </b>
 1. Take out a piece of paper or your calculator, and figure out the average for all of the numbers in data. What did you get?
 
 
 2. You need to explain to a friend how to calculate an average. Could you break the process down into just a few basic steps? Write your steps here
 

In [7]:
## Average

sumdata = 0
for x in data:
    sumdata = sumdata + x
        
average = sumdata/len(data) 

print (average)

13.460000000000003


### Using Python to calculate a standard deviation

The equation for standard deviation is  $ s={\sqrt {\frac {\sum _{i=1}^{N}(x_{i}-{\overline {x}})^{2}}{N-1}}} $ where n is the number of data points you have, $ {\overline {x}} $ is the average and $ x_{i} $ is any one of your data points.

If you don't remember the  $ \sum $ symbol; note that it's called a summation, and it means that you do the operation inside the summation ( in this case $ (x_{i}-{\overline {x}})^{2} $ for each data point, and then add them all together.


For your practice (and to make sure you really understand the math!), take a minute here and calculate the standard deviation by hand. <b> Answer the following questions before you proceed </b>
1. What value did you get for the standard deviation when you calculated it by hand?

2. Break the process you did down into steps. Which step do you think a for loop might help you do faster? 

Now let's write some code!

In [8]:
 #standard devation

#note that we must sometimes 'import' additional commands to suppliment python's basic library. 
# The "math" library gives us access to extended mathemtical functions, like square roots and logarithms.
import math
# add your comments here; what are we doing?

sigma_x = 0
n = len(data)

# add your own comments here
for x in data:
    sigma_x = sigma_x + ((x-average)**2)
    
    

std = math.sqrt(sigma_x/(n-1))

print(std)

1.4484474446799926


## Correctly Formatting Outputs


<b> Whoa, thats a lot of digits! </b> That can't possibly be the right number of significant digits, right? Remember that the computer only knows how to do what you tell it to do! So if you don't tell it to round to a certain number of digits, it will just give you everything it has stored. So lets figure out how to round those numbers to something a little more reasonable, and while we're at it, we'll do a better job of presenting this data with the correct formatting of average +/- standard deviation.

### New Rule for Sig Figs

Forget the rules you memorized in gen chem for significant figures. When working with real data, we should always have a reasonable approximation of error, which will define our significant digits for us. In this case, we'll use standard deviation as our estimate of error. So you have two steps now to determine correct significant figures:
1. Round the error value to one significant digit (or two IF the error value begins with a 1)
2. Round the average to the same decimal place as the error value

### Formatting Results
Never report averages without error, and never report averages or errors without units! The correct format, if we're working with absolte error, is always "average +/- error units".



In [9]:
# Using F printing to get the correct sig figs and formatting for this answer

print(F'The average lead measurement in our data set is {average:.1f} =/- {std:.1f} ppb')


The average lead measurement in our data set is 13.5 =/- 1.4 ppb


## Part 6 - Confidence Intervals

So far, everything we've done should have been a review from Gen Chem! But we can do better than standard deviation as an estimate of error for experimental data. To do this, we use Confidence Intervals

The equation for confidence interval is  $$ CI= {\frac {ts}{\sqrt {n}}} $$ We already know how to get n (from the length of the list!) and you already calculated s! So now we just need t. Luckily, Python has those t-tables from your text book, we just have to tell it which one we need! See the code below:



In [10]:
import scipy.stats as stats
#the first input is confidence %, the second is degrees of freedom (n-1)

# we will always use "two tailed t values", so the confidence interval format is slightly different than expected

#confidence level

alpha = 0.05 # 1- alpha should equal your confidence interval. Here, we use 0.05, for a 95% confidence interval
dof = len(data) - 1 # degrees of freedom is the number of samples (i.e. the length of the list) minus one

#two tailed t statistics require the following format:

t = stats.t.ppf(1-alpha/2, dof) #inputs are alpha to set the confidence interval and degrees of freedom

#check that this matches the value in the textbook!
#print(t)


# calculate the actual confidence interval
CI = std*t/math.sqrt(len(data))


print (F"the average concentration is {average:.1f} +/- {CI:.1f} ppb")

the average concentration is 13.5 +/- 1.8 ppb


## Bigger Data Sets

Obviously with 5 data points, writing all this code isn't really all that much faster than just doing the math by hand, in your calculator or in Excel. Where this becomes a real advantage is when the data sets are very large.
Here is an example data set from the California Environmental Data Exchange Network (an organization whose mission is to generate high quality acceible and usable data to help protect and restore California's watersheds). This is NOT drinking water data, it is surface water data (lakes, streams, ponds, etc.) but it is a great example of a big database we might want to work with!
<b> Before you begin running any of this code, we need to load the data file! </b>

In [11]:
# Don't worry about anything in this box, this is just a strategy for importing data into a list so we can use it!
import pandas as pd
data = read_csv("cd3-Pb.csv")

Pb_raw = data['result'].tolist()
units = data['unitname'].tolist()
Pb = [x for x in Pb_raw if pd.isnull(x) == False]
#print(newlist)
#print(data)

n,bins,patches =plt.hist(Pb,100)

  
plt.xlabel('Pb in ug/L (ppb)')
plt.ylabel('Number of samples')
  
plt.title('Histogram of Pb in CEDEN water data',
          fontweight ="bold")
#Just run this block, to see all of the Pb measurements this database has!

NameError: name 'read_csv' is not defined

### Whoa thats a lot of data!
OK, we don't want to try to work with this data manually. Even working in Excel would be a little tedious with this many data points! So let's process this data in Python.
Let's get a sense for this data, and then calculate our descriptive statistics!

## 1. Range and n

In [None]:
top = max(Pb)
bottom = min(Pb)
n = len(Pb)

print(F'This data set has {n} points, with a maximum of {top} {units[0]} and a minimum of {bottom} {units[0]}')


## 2. Average and Standard Deviation

The functions we wrote above are actually already built into to Python! We'll use the built in functions here, for simplicities sake!

In [None]:
import statistics as stat

x_Pb = stat.mean(Pb)
s_Pb = stat.stdev(Pb)

print(F'The average lead measurement is {x_Pb} +/- {s_Pb} ug/L')

## Confidence Interval 

In [None]:
#confidence level

alpha = 0.05 # 1- alpha should equal your confidence interval. Here, we use 0.05, for a 95% confidence interval
dof = len(Pb) - 1 # degrees of freedom is the number of samples (i.e. the length of the list) minus one

#two tailed t statistics require the following format:

t = stats.t.ppf(1-alpha/2, dof) #inputs are alpha to set the confidence interval and degrees of freedom

#check that this matches the value in the textbook!
#print(t)


# calculate the actual confidence interval
CI_Pb = s_Pb*t/math.sqrt(len(Pb))

print(F'The average lead measurement is {x_Pb} +/- {CI_Pb} ug/L ({(1-alpha)*100} % confidence level')