# Lab One - Climatic Averages 

## *Analyzing the Global Temperatures Divergence from Average from 1880 - 2018*

In this lab we learn part 1 basics of Python (the programming commands) for data analysis through utilizing the Jupyter environment (this display) to analyze data. 

You will learn how to: 


- Use Jupyter
- Read in a CSV (comma seperated data file) into a data format for analysis
- Implement Simple Flow Structures (for loops, if statements)
- Plotting Basics (line plots, bar charts, and colors)
- Indexing (simple, and boolean indexing)
- Use the following data structures: numpy arrays and Pandas dataframes
- Use hex codes for colors

By the end of this lab you should be able to: read in simple data from CSV, use boolean indexing, and make a line plot / bar chart

Additional materials for reading and reference: Igual & Seguí Chapter 1 Chapter 2: Sections 2.1 through 2.6.2 

More on Jupyter here:
    http://jupyter-notebook.readthedocs.io/en/stable/notebook.html

## *Part 1 - What is Jupyter?*



Jupyter is a interactive environment (for example this Notebook is in Jupyter) where we can explore how a programming language, i.e. Python, works. I like to think of this as a display format which is mixed text like this box and code in the next box - we will be doing the first lab reports using Jupyter. Note this is not the only way to run Python. 

### Running Cells: 
- You can "run" various cells at a time by hitting shift-enter OR by hitting run after selecting a cell through the menu. Note you can run blocks of cells (say before or after a certain line through the Cell tab as well)

### Types of Cells: 
- You can have cells as programming commands or as text. You can switch between programming language and text through the Cell -> Markdown option. 

### Editing Cells: 
- To edit a cell - double click in the cell to interact with the program. 


### Interactive Exercise 1!

Let's try this out! We have two blank cells below. In the first cell type the following: 
print('Python is awesome')

In the next cell type the same words, but this time switch the cell to "Markdown" through the cell menu. 

Then run both cells through typing Shift+Enter OR Chitting the Run button above. 

When you are done please put your nametag down. We will discuss shortly!

In [None]:
# This is cell one - type your commands below.

# In Python a '#' is a comment, anything after this 
# will not be evaluated as a command.






In [None]:
# This is cell two - type your commands below.







'

''

'''

''''

'''''

''''''

'''''

''''

'''

''

'







### Why are we using Jupyter? 

There are many ways to use Python. We are starting with Jupyter because (1) it allows us to learn together, (2) by using Jupyter you can enable a lot of self-learning (post class) on your own, and finally (3) your lab reports are much more interesting to complete. You get to turn in great Notebooks that are pre-evaluated. This means if you wrote half your code down but half didn't work, we can still give you partial credit. Wahoo! Becoming a coding expert takes time and effort, we want to reward this effort.  

## *Part 2 - What is Python and why are we using it?*

Python is an object oriented programming, interpreted language. This means it has 'objects' which have certain rules or methods and attributes which you can access to run programming. The interpreted part means that there is no compilation. The interpreter reads code line by line from beginning to end. This Notebook is running a version of Python called Interactive Python, or IPython. We will see the benefits of using IPython in action in this first lab. We have already seen the print command above and compared markdown cells to programming cells. 

### *Part 2. A - Let's start by importing some packages*

In [None]:
# Think of packages like enabling different levels of a game. 
# The hashtag we use to make a comment, this line can be used so that the computer skips 
# reading this code.

import numpy as np
import pandas as pd

# These two packages enable data analysis through various objects (and their methods) and 
# data types. 

# For example, you can create a numpy array - 1D data array - of numbers as follows:

# This creates an arrange from 0 up to 10 but note including 10, every second value.
example_array = np.arange(0, 10, 2)

# You can see this as follows with the following print command we saw earlier.
print(example_array)

In [None]:
# Each object in Python has a type, in this case we see that it is a numpy array.
type(example_array)

In [None]:
# We can access different values within an array with indexing, for example:

print(example_array[0]) #print index 0 value
print(example_array[1]) #print index 1 value

In [None]:
# Types also include, integers (non-decimal numbers), floats (decmial numbers), strings (words), many others...

# For example
first_entry = example_array[0]
print('{} has type {}'.format(first_entry, type(first_entry)))

example_int = 5
print('{} has type {}'.format(example_int, type(example_int)))

# We can also (sometimes) change the type of certain objects.
example_float = float(example_int)
print('{} has type {}'.format(example_float, type(example_float)))


# If you get a type error when you run code, use these type commands to see what is the 
# issue, most likely you are trying to do something that can not be done to a string, 
# or an integer etc. 

# Note the type of this number is an numpy int64
# as compared to the type of of example int which is JUST int

### Interactive Exercise 2!

What do you think printing the index of -1 would output? We saw indexing of 0 and 1 before -
in the following cell print the value in the ExampleArray located at -1 index. When done put 
your nametags down. While you wait for others to finish, what do you think this 
implies about indexing in Python? How could this be advantageous? When might it become a problem?


In [None]:
###write your command below




'

''

'''

''''

'''''

''''''

'''''

''''

'''

''

'

### *Part 2. B - Now what about that other package we imported -  Pandas*

Pandas is a package which we use to enable data set analysis. There are other data structures within Python that we 
can use as well, for example numpy arrays, dictionaries, and lists. We will focus right now on pandas dataframes.

### What is a Pandas DataFrame?

A pandas dataframe is a 2D data structure which includes an index and rows and columns of data. These can include ints, floats, strings, etc. 

A good way to think of a pandas dataframe is an excel spreadsheet which we analyze and interpret with Python. Enough about all this, let's get our hands on some data. 


In [None]:
# Within the folder you all downloaded is a data subdirectory with global temperature 
# anomaly data from following comma seperated values (CSV).

# This data is from the following website: 
# https://www.ncdc.noaa.gov/cag/time-series/global/globe/land/ytd/12/1880-2018

# Provided by NOAA National Center for Environmental Information

# Global temperature anomaly data come from the Global Historical Climatology 
# Network-Monthly (GHCN-M) data set and International Comprehensive Ocean-Atmosphere 
# Data Set (ICOADS), which have data from 1880 to the present. 

# These data are the timeseries of global-scale temperature anomalies calculated with respect to 
# the 20th century average.

# The following command reads from a csv format into a pandas dataframe and assigns it to our variable
# that we named temp_var_global

# Note the header=4 argument; this designates four skipped lines before assigned variables. 

temp_var_global = pd.read_csv('./Data/global_land_ocean_1880_2018_temp_variants.csv', header=4)

In [None]:
# Let's see if this was read in correctely - you should ALWAYS do this to make sure you read this in. 

# Check the type - it should be a dataframe.
print(type(temp_var_global))

# Check the first 10 rows
temp_var_global.head(n=10)

# Here you can see on the left hand side the index values, followed by the year, then Value. 
# This data did NOT provide a nice column name for the rows in the CSV file so we have the 
# 'Value'.

# Given the information above and at the website we know that this is the temperature anomoly 
# for Earth for each year in Celcius.

In [None]:
# You can also check the column values as follows:

print(temp_var_global.columns)

In [None]:
# Let's rename the column so it's not a vague "Value".

temp_var_global.rename(columns={'Value': 'AnomalyC'}, inplace=True)
#inplace = True is to prevent redefining a NEW dataframe object
# What would happen if we did not use inplace=True ?

# Let's check the columns
print(temp_var_global.columns)

In [None]:
# How about for fun, let's check the 50'th entry?
print(temp_var_global.loc[50, 'AnomalyC'])

In [None]:
# What about the entire row?
print(temp_var_global.loc[50, :])



### Put your nametag down when you have reached this point. Can you guess what : does within this example?
'

''

'''

''''

'''

''

'

### *Part 2. C - Adding Values into a Dataframe*

### Our temperatures are in Celsius, maybe it would make sense to create a column in Farenheit?

Formula is: $\Delta\mathrm{T}(F) = \Delta\mathrm{T}(C) \times 1.8$ ; for a temperature difference we, skip adding $32$.

**Note: Explore the Markdown version of this cell by double-clicking this line.**


*This is how we write equations using LaTeX in a Jupyter Notebook.* See https://en.wikibooks.org/wiki/LaTeX/Mathematics for more examples of how to use LaTeX.

In [None]:
# Here we define a NEW column based off the old column.
temp_var_global['AnomalyF'] = temp_var_global['AnomalyC'] * 1.8

# Let's make sure this did what we wanted.
print(temp_var_global.head(n=10))

In [None]:
# Notice that the dataframe is ordered by year. 
# What if we instead wanted to order it by temperature anomaly?

# Pandas Dataframes have a method for sorting.
temp_var_global_sorted = temp_var_global.sort_values("AnomalyC")
temp_var_global_sorted.head(10)

In [None]:
# To find all available methods for any object, one can use 
# tab completion on the dot operator...
# Type "temp_var_global." below, and then press tab 
# with your cursor next to the dot. (it may take a few seconds to load)









In [None]:
# To find out what each method or attribute is, type the command followed by "?"
# and run the cell.
temp_var_global.sort_values?


# You can also find most documentation for Python and various packages online.
# E.g. https://pandas.pydata.org/pandas-docs/stable/

### We will pause here to discuss what questions have on Pandas or Numpy.

'

''

'''

''''

'''

''
 
'

## *Part 3. Basic Plotting*

In [None]:
# Step one, we import a plotting package.
import matplotlib.pyplot as plt

# To enable our visualization within within the notebook,
# we use the following command
%matplotlib inline

In [None]:
# Define a figure (think of this as a page).
fig = plt.figure(figsize=(10, 5))

# Let's give it a title.
fig.suptitle('Temperature Variants From 1800 - 2018 Global Averages', fontsize=20)

# Lets just make a line plot, worry about everything else later
plt.plot(temp_var_global['Year'], temp_var_global['AnomalyF'])

# And label some axes. 
plt.ylabel('Anomaly $\degree$F', fontsize = 20)
plt.xlabel('Year', fontsize = 20);

# Notice the ";" at the end of the last line. The semicolon suppresses output. Try running this
# cell after removing the semicolon. What does this imply about what the plt.xlabel() method returns?
# What do you think the plt.plot() method returns? Why might this be useful?


# NOTE: using $COMMANDS$ in a text entry will enable mathematical symbols through LaTeX 


### We will pause here to discuss our opinions on this plot.

'

''

'''

''''

'''

''
 
'

In [None]:
fig = plt.figure(figsize=(10, 5))
fig.suptitle('Temperature Variants From 1880 - 2018 Global Averages', fontsize=20)

plt.plot(temp_var_global['Year'], temp_var_global['AnomalyF'])

plt.ylabel('Anomaly $\degree$F', fontsize = 20)
plt.xlabel('Year', fontsize = 20)


#----------------------------------------- we add the following to our code
plt.xticks(fontsize=16) #make the xaxis labels larger
plt.yticks(fontsize=16) #make the yaxis labels larger
plt.axhline(y=0.0, color='k', linestyle='--') #add a horizontal line at 0
plt.grid(color='gray', linestyle='dashed') #add a grid so it's easier to tell if at zero
#-----------------------------------------



### What we loose here in terms of the information presented is that we KNOW that each point has a definitive width of 1 year, it's an average over that year, the chaotic behaivor of the line is more misleading than informative. A bar chart would fix this issue. Go ahead and try out the next plot block. 

'

''

'''

''''

'''

''
 
'


In [None]:
fig = plt.figure(figsize=(10, 5))
fig.suptitle('Temperature Variants From 1880 - 2018 Global Averages', fontsize=20)

#----------------------------------------- we have edited this into a bar chart

# Note we make it purple for fun ;)
# also the align = 'edge' will align to the left side of the range
# width = 0.8 rather than 1 simply to make it appear more interesting. Go ahead and play 
# with the width to see why it's at 0.8
plt.bar(temp_var_global['Year'], temp_var_global['AnomalyF'], width = 0.8, align='edge', 
        color = 'purple')
#----------------------------------------- 
plt.ylabel('Anomaly $\degree$F', fontsize = 20)
plt.xlabel('Year', fontsize = 20)

plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.axhline(y=0.0, color='k', linestyle='--')

plt.grid(color='gray', linestyle='dashed')


## *Part 4 - More advanced plotting*

So while purple was fun, that's not really the main message of this chart. We want to show when it's greater than zero, and less than zero. It would be best if these were actually different colors.

Remember how we set this equal to purple? We can also assign a column JUST for colors in our dataframe so that each bar could have different colors when we plot it. There are two more obvious ways to go about this. In both ways our goal is to create a NEW column in the data frame with the colors of the bars where if less than zero we make blue ('b'), greater than we make red ('r').

#### Come up with an idea of how you would do this. Do not look below to gain inspiration. 
#### Then discuss your ideas with your neighbor. Be ready to share with the class. 



'

''


'''

''''

'''

''

'

The two ways we are going to learn in this laboratory are as follows. 

Way 1)

- The tried and true brute force method of using a for loop to loop over every data value in our array, create a new column, and fill with what color we want it to be. 

Way 2)

- Use the built-in methods of objects (in this case pandas.DataFrame) to avoid writing extensive loop structures


Let's start with way one. 

### *Part 4. A. METHOD ONE  - For Loops.*

In [None]:

# Set up new column just for colors, and fill with 'g' for now. We will know if we did 
# Something wrong when we plot it because it will be green
temp_var_global['Colors'] = 'g'

# Now we want to "loop over" our data frame, setting each value in the frame in the column of 
#'Colors' to a set value. 

# What do I mean by a for loop? Here's an example, remember our numpy array from earlier? 
for i in range(len(example_array)):
    print("The value of example_array at index {1} is {0}.".format(example_array[i], i))
    

In [None]:
# What if we want to count each entry? There's a function called enumerate for numpy arrays
# note it starts at zero. 
# The following produces the same as the previous for loop.

# this will loop over the pair index AND entry
for index, entry in enumerate(example_array):
    print("The value of the array at index {} is {}.".format(index, entry))

### Now let's apply what we learned from the for loops to the dataframe.

In [None]:
# This one is a bit more complicated...but it's the same basic idea where enumerate for 
# numpy arrays is replaced with iterrows() for pandas dataframes, which is exactly what 
# it sounds like, iterate over rows :) 

for index, row in temp_var_global.iterrows():
    #iterrates over the entirty of the dataframe
    print("At index {} and year {} the value of AnomalyF is: {}".format(index, row['Year'], 
                                                                        row['AnomalyF']))
    

In [None]:
# Now lets actually assign colors, we can use if statments here another flow control structure

# in words what this loop means: for every entry in our dataframe, see if > 0 or less < 0


for index, row in temp_var_global.iterrows():
    
    #check if greater than 0
    if row['AnomalyF'] > 0:
        #set value in array as red
        temp_var_global.at[index, 'Colors'] = 'red'
    
    #check if less than 0, or equal to
    if row['AnomalyF'] <= 0:
        #set value in array as blue
        temp_var_global.at[index, 'Colors'] = 'blue'

# Note: you can find more pre-defined colors in matplotlib here:
# https://matplotlib.org/gallery/color/named_colors.html#sphx-glr-gallery-color-named-colors-py

### PAUSE. Think about what is happening in this loop. Explain it to your neighbor. When you are done chatting put your name tag down. 

'

''

'''

''

'

In [None]:
#let's check to see what this looks like, print from index 50 - 70
#print both colors and anomalyF columns

print(temp_var_global.loc[50:70, ['Colors', 'AnomalyF']])

In [None]:
#and now let's plot it!


fig = plt.figure(figsize=(10, 5))
fig.suptitle('Temperature Variants From 1800 - 2017 Global Averages', fontsize=20)

#------------------------------------------------  we edited the following color statement ONLY
plt.bar(temp_var_global['Year'], temp_var_global['AnomalyF'], width = 0.8, align='edge', 
        color = temp_var_global['Colors'])
#------------------------------------------------

plt.ylabel('Anomaly $\degree$F', fontsize = 20)
plt.xlabel('Year', fontsize = 20)

plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.axhline(y=0.0, color='k', linestyle='--')

plt.grid(color='gray', linestyle='dashed')

### *Part 4. B. METHOD TWO  - Boolean Indexing.*

In general python "slows down" with loops. It also allows for better programming LATER (in class) if we use what we
call 'Boolean Indexing'.

We use those logic statements before for greater or lesser and use them to select positions in the dataframe to then subset into our data. For example, all places with >0, all places with <0


In [None]:
#Step One, create boolean arrays, we call these indexes
boolean_index = temp_var_global['AnomalyF'] > 0 

print(temp_var_global.loc[0:10, 'AnomalyF'])
print(boolean_index[0:10])

#let's see what these look like!

In [None]:
# The tilda operator flips the boolean index to the opposite truth value
print(~boolean_index[0:10])

In [None]:
# Now, rather than the for loop we can set up a different system
temp_var_global['Colors2'] = 'k'
temp_var_global.loc[boolean_index, 'Colors2'] = '#9F4E58' 
temp_var_global.loc[~boolean_index, 'Colors2'] = '#64ACEA'
# To customize our colors we use hex code - check out this website. 
#http://www.color-hex.com/color/90537c
print(temp_var_global.loc[0:10, ['Colors', 'Colors2']])

In [None]:
fig = plt.figure(figsize=(10, 5))
fig.suptitle('Temperature Variants From 1800 - 2017 Global Averages', fontsize=20)

#-------- we edited the following color statement ONLY
plt.bar(temp_var_global['Year'], temp_var_global['AnomalyF'], width = 0.8, align='edge', 
        color = temp_var_global['Colors2'])
#----------

plt.ylabel('Anomaly $\degree$F', fontsize = 20)
plt.xlabel('Year', fontsize = 20)

plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.axhline(y=0.0, color='k', linestyle='--')

plt.grid(color='gray', linestyle='dashed')

#how to save figures - 
#this first command will not save with a white background, the second will
plt.savefig('./Figures/TempVariants_GlobalYearlyAverages_Transparent.png', transparent=True)
plt.savefig('./Figures/TempVariants_GlobalYearlyAverages.png')



#please go see within your Jupyter folder the .png file

### How did the methods we learn differ and compare to each other? The for loop vs the boolean index method? Specifically, did you like one more than the other?

'

''

'''

''''

'''

''

'

# SUMMARY

From this lab you have learned the basics of python packages including: numpy, pandas, and matplotlib.pyplot. 

We have also learned about flow control structures (for loops, if statements) and how to access a pandas data frame through boolean indexing and through normal indexing and how to manipulate various columns here. 

Next time we will continue our learning of Python.

Homework: Please complete Assignment 1 located on Canvas - due in 1 week on Wednesday, January 22. Office hours will be held on Friday from 1-3pm in CSRB 2218 (two days from now). If this time doesn't work please email us and we can arrange seperate hours.

'

''

'''

''

'

### Additional Material - Error propagation and reporting

Suppose many people took measurements of the length and width of the CSRB and found:

Length = $112.1 \pm 0.4 \mathrm{m}$

Width  = $55.5 \pm 0.5 \mathrm{m}$


How would you report the area, including uncertainty, of the footprint of the building?

We know that Area = Length x Width $\pm$ $\delta A$, where 

$$\delta A = \sqrt{\left(\frac{\partial A}{\partial w} \delta w \right)^2 +  \left(\frac{\partial A}{\partial l} \delta l \right)^2 }$$

This is equivalent to:

$$ \frac{\delta A}{A} = \sqrt{\left(\frac{\delta w}{w} \right)^2 + \left(\frac{\delta l}{l} \right)^2}$$


Let's see how this looks in code!

In [None]:
# Assign values given in problem.
length, width = 112.1, 55.5
dl, dw = 0.4, 0.5

# Calculate Area
area = length * width

# Calculate error propagation; we will import from Python's built-in math module
from math import sqrt

# Using the first equation; note in Python the exponent operator is **
dA = sqrt((length * dw)**2 + (width * dl)**2)

# Let's calculate dA using the second equation
dA_over_A = sqrt((dw / width)**2 + (dl / length)**2)

# And to get dA, we will need to multiply by our calculated Area.
dA_2 = dA_over_A * area

In [None]:
# Are the two equations above equal?
dA == dA_2  # == is a boolean operator that tests for equality, can also use != for not equal

We are now ready to report our calculation and uncertainty.

In [None]:
print("Area = {} +\- {}".format(area, dA))

**Don't forget to use significant figures and units!**



In [None]:
print("Area = {0:0.1f} +/- {1:0.1f} m^2".format(area, dA))

You can find more about formatting strings in Python in the documentation.

See Format Specification Mini-Language on the webpage:

https://docs.python.org/3.7/library/string.html

We can also use a markdown cell to explicitly type in our answers, and use $\LaTeX$ to make it look presentable.

**Here is the preferred way to report our calculation**:

$$ A = 6221.5 \pm 60.3 \mathrm{m}^2 $$

For help on Markdown and LaTeX syntax, see:

https://www.markdownguide.org/cheat-sheet/

https://en.wikibooks.org/wiki/LaTeX/Mathematics