# Lab 1 - Part 2: Environmental Data Sets

## *Lab Overview*

This is the second part of Lab 1; it’s purpose is to show you some ways to import real data sets into Python. Real data sets are often large, come in a variety of formats, and can be finicky to deal with. Therefore, it may take you a few tries to get this working: be creative and persistent!

## *Learning Goals*
After this lab you should be able to :
- Load a variety of real data sets into Python
- Create and use a mask to remove missing values from data
- Create a vector of datenumbers using a for-loop
- Generate x-y plots of loaded data with appropriate axes and figure captions

## *To Hand In*

1. A plot of the Mauna Loa CO2 time series.
2. A plot of the MEI index.
3. A plot of the data you downloaded for the pre-lab.
4. A figure caption for the plot of your data set.

## 1. Mauna Loa CO2 time series

First you will make your own plot of the Mauna Loa data which you saw on the first day of class. 

Before you begin, remember to create a new sub-directory (folder) in your Z: drive where you will be able to save all files from this lab session. Ensure MATLAB is working in this directory. **will need to see how we do this for notebooks**


Download the file `monthly_maunaloa_co2.csv` from the class website on Connect. This contains atmospheric CO2 data recorded at the Mauna Loa Observatory. Save the file in the correct directory.

Now open the file in a excel file text editor **original doc has a note on this to use vim?** and observe how the data is organized. Notice the detailed header; we’ll have to tell Python to ignore the header, but you should at least scan it so you know what the data represents. How many header lines are there? What do the columns represent?

First, we are going to load the data from the file into Python with the help of the Pandas Library. Start of by running the cell bellow and importing the library:

In [None]:
#library to help us manipulate data
import pandas as pd

We will then read and save the data using the pandas.read_csv() function. We will use the following input arguments with this funciton:

- The file name
- the row numbers where the headers are located in the CSV file (note that python is zero-indexed so we will enter the row number from the CSV file -1)
- which rows we dont want to save as column headers or data points

complete the code below and run the cell:

In [None]:
muana_loa_data = pd.read_csv(r"monthly_maunaloa_co2.csv", header= [54], skiprows = [55,56])

This snippet of code will create the DataFrame ‘mauna_loa_data’ containing ten columns, one for each column in the CSV file.

DataFrames are data structures in python that resemble spreadsheets. Each row of the DataFrame will be automatically indexed starting at 0, and each column will adopt the header name in the original CSV file. Each cell contains the data for the CSV file in the form of an object as interpreted by python. 

Its a good habit to make sure everything was parsed correctly, there are several functions that may help us when dealing with dataframes. Lets take a look:

In [None]:
#we can print the whole datasets, notice how it resembels a spreadsheet
print(muana_loa_data)

Sometimes we dont want to print the whole dataset, especially when dealing with large datasets. We can instead do the following:

In [None]:
#If we put any number n as an input, we can see the first n rows of our data set. 
print(muana_loa_data.head(3))

In [None]:
#if we dont put an input to the function we get the first five rows back as default.
print(muana_loa_data.head())

In [None]:
#we can use the .head function with input 0 to see the column names
print(muana_loa_data.head(0))

Lets also take a look at how Python interpereted our data, this may be useful later on:

In [None]:
print(muana_loa_data.dtypes) 

You might notice that our column names have inconsistent spacing. This is because the pandas library will literally read the CSV file. We might want to get rid of these whitespaces to make it easier to access the data later on. We can use the replace function to do so:

In [None]:
#replacing all spaces with an empty string
muana_loa_data.columns = muana_loa_data.columns.str.replace(' ','')

#lets see how our column names look now
print(muana_loa_data.head(0))

Now that we made our DataFrame a bit easier to use we can plot the data with the help of the matplotlib library. Start off by importing the pyplot library:

In [None]:
import matplotlib.pyplot as plt

We can now save the values we want to plot. We are going to take the contents of the cells we are interested in and assign these to their own variables. We want cells 4 and 5, to contain the date and measured CO2. If we look at our outputs from above we can see the respective names are "Date" and "CO2"

In [None]:
co2_date =muana_loa_data["Date"].values
co2 = muana_loa_data['CO2'].values

#lets take a look at our variable
print(co2_date)
print(co2)

You may notice our co2 and date variable contains two columns. this is because we had two different columns with the same name. lets fix this up and keep only the colums we want **this is an issue with the data**

In [None]:
co2_date = co2_date[:,1]
co2=co2[:,0]

Lets do all this again for the seasonally adjusted data: **lab doesnt ask for this but TA code does do this**

In [None]:
co2sa = muana_loa_data["seasonally"].values
co2sa = co2sa[:,0] # https://stackoverflow.com/questions/40557910/plt-plot-meaning-of-0-and-1 explanation of this notation, good to use in lab 1a
co2fit = muana_loa_data["fit"].values
co2safit = muana_loa_data["seasonally"].values
co2safit = co2safit[:,1]

Now that weve succesfully wrangled all our data, we can finally plot it

In [None]:
#the first input of our .plot function will be our x axis, the second input will be our y axis
plt.plot(co2_date,co2) #plotting data
plt.show

As you can see the pyplot automatically puts x and y ticks, however, we might want to add things such as labels for our axis, a title, or maybe even change the color of the plot

In [None]:
# the 'r' here specifies we want our plot line to be red
plt.plot(co2_date,co2,'r')

In [None]:
#you can have alot more fun with styling, some examples:
'b'    # blue markers with default shape
'or'   # red circles
'-g'   # green solid line
'--'   # dashed line with default color
'^k:'  # black triangle_up markers connected by a dotted line

# take a look at the function documentation for more styling options:
# https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

Lets add our title and axis labels **note plt.show() is not necessary using jupyter. will probably remove, but good to discuss**

In [None]:
plt.plot(co2_date,co2,'r')
#setting the title
plt.title('Raw CO2 Data vs. Time with missing data')
#setting the x axis label
plt.xlabel('Date')
#setting the yaxis label
plt.ylabel('CO2 (ppm)')
plt.show() 

Notice the negative values in the time series. These are missing values. Go back to the header, determine how missing values are indicated, and remove these from the data set using the masking technique you learned in Part 1 of this lab. Then make a new plot, adding the correct labels on the x- and y-axes.

Save the plot to a PDF or JPG file. Please print and submit this figure.

In [None]:
#import the masking library
import numpy.ma as ma
#importing numpy
import numpy as np

**masking is a bit weird here. my IDE was different, will go over it again.
It seems like the way jupyter is masking the data is different then my IDE/sources online (my idea masks the data, here it seems data is being removed or maybe plot is avoiding the masking)**

In [None]:
co2 = ma.masked_where(co2_date<0,co2)
print(co2) ##looks like values are being removed not masked

In [None]:
#masking empty entries
co2 = ma.masked_where(co2_date[:631]<0,co2) #masking data
co2_date = ma.masked_where(co2<0,co2_date[:631]) #TA solution doesnt mask data data. May remove.
co2 = ma.masked_where(co2<0,co2) #masking data
co2_date = ma.masked_where(co2_date<0,co2_date) #TA solution doesnt mask data data. May remove.

#Plotting all raw CO2 data with fitted curve overplotted
plt.plot(co2_date,co2,'r-',label='CO2 Raw') #plotting data
plt.title('Raw CO2 Data vs. Time')
plt.legend()
plt.ylabel('CO2 (ppm)')
plt.xlabel('Date')
plt.show()

**TA code also plots fitted data on top**

In [None]:
#masking empty entries
co2fit = ma.masked_where(co2_date<0,co2fit[:631])
co2fit = ma.masked_where(co2<0,co2fit)
co2fit = ma.masked_where(co2fit<0,co2fit)
co2 = ma.masked_where(co2_date[:631]<0,co2) #masking data
co2_date = ma.masked_where(co2<0,co2_date[:631]) #TA solution doesnt mask data data. May remove.
co2 = ma.masked_where(co2<0,co2) #masking data
co2_date = ma.masked_where(co2_date<0,co2_date) #TA solution doesnt mask data data. May remove.


#Plotting all raw CO2 data with fitted curve overplotted
plt.plot(co2_date,co2,'r-',label='CO2 Raw') #plotting data
plt.plot(co2_date,co2fit,'k-',label='CO2 Fit')
plt.title('Raw CO2 Data vs. Time')
plt.legend()
plt.ylabel('CO2 (ppm)')
plt.xlabel('Date')
plt.show()

**TA code also plots seasonally adjusted data**

In [None]:
#masking data
co2sa = ma.masked_where(co2sa<0,co2sa)
co2safit = ma.masked_where(co2safit<0,co2safit)

#plotting seasonally adjusted CO2 data with fitted curve overplotted
plt.plot(co2_date,co2sa[:631],'g-',label='co2sa') #plotting data
plt.plot(co2_date,co2safit[:631],'k-',label='co2sa')
plt.legend()
plt.title('Seasonally Adjusted CO2 Data vs. Time')
plt.ylabel('CO2 SA fit (ppm)')
plt.xlabel('Date')
plt.show()

## 2. MEI Index
Now we are going to import and plot a time series of the Multivariate ENSO Index (MEI), a climate index which represents the state of the El Niño-Southern Oscillation. The MEI gives an indication of whether the tropical Pacific is in an El Niño (warm water, positive MEI) or a La Niña (cold water, negative MEI) state.

From the class website, download the file MEI.html and place it in your working directory. As before, open the file in a text editor and observe how the raw data is organized. This is always your first step.

Notice, that there are 13 header lines, 63 rows of data, all followed by some useful tips at the end of the file. Each row represents a year and each column a month within that year.

**Original: Though we could again use textscan() to upload the data, we will use a different approach this time. In the future, feel free to use whichever suits you better.**

**should i also try using a differnet method above**

Begin by making a copy of the file and renaming it something like MEI_data_only.txt. Now, in this new file, simply delete all the header lines as well as the trailing information, leaving you with only the 63 lines of data. Make sure to save.
Notice that the last line of data doesn’t contain the same number of columns as all the others. This is because we don’t have data from Aug-Dec 2012.

**Original: However, when we use the load command (below) missing data cannot be indicated with blank spaces because, as we’ve seen before, MATLAB ignores white space. Therefore, manually enter the five missing data points as NaN which simply stands for ‘not-a-number’. The final line in your data file should now look like this:
2012 -1.046 -.702 -.41 .059 .706 .903 1.139 NaN NaN NaN NaN NaN ** 
**I believe this is not really applicable here**

Don’t worry if the numbers aren’t evenly spaced or if we have missing values, Pandas will take care of everything for us behind the scenes.

Now with a text file containing only data and no extra information, We can load that data using the read_csv function from before, but with slightly different inputs:

In [None]:
#We will add the sep=’\t’ to specify that our data is separated by \t

#We will add the index_col=0 input so our rows will be named the same as the 0th row on the file.
#with .txt files If we dont do this, Python will automatically number the rows from 0 to n
#and our colum headers will be used as data.

#We will add header=None so the first row of data wont be interpreted as the column names

M = pd.read_csv('MEI_data_only.txt', sep='\t',index_col=0,header=None)

With this, M is a matrix containing all the data from the text file. Try printing it to see how it is organized. 

In [None]:
print(M)

This complete, there are two challenges left before we can plot this data. First, we need to reshape the DataFrame into a one-dimensional array (call this mei) so that it can be plotted. Secondly, since our data only has year information but no month information, we need to create an appropriate array (called dates) the same size as mei to represent the date, both year and month. The first step is quite easy as Python has built-in functions to reshape DataFrames 

In [None]:
#reshaping array
mei=M.to_numpy().flatten()

#seing whats up
print(mei)

Now that you have a single one-dimensional array containing your data points, you need to make an array holding the corresponding dates. To do this, you can use the method outlined below. But before, a bit about dates in Python

### Dates in Python
In order to make a proper time series in Python, you have to understand a little about how the program stores and displays dates. Python has a module called datetime which allow you to create complex data structures that represent dates. First lets import the module:

In [None]:
import datetime 

now lets get to know the module:

In [None]:
#This call returns today’s date. 
print("todays date " + str(datetime.date.today()))

#We can also get the time right now:
print("the time right now " + str(datetime.datetime.now()))

#And finally, we can also create our own datetime object :
x=datetime.datetime(2022,2,2)
print("the date we created is "+ str(x))

#Creating such objects will allow us to have some functionality with our date data, for example we can change dates:
x = x.replace(year=2019)
print("and now it is " + str(x))

Having this sort of functionality is especially important when we want to create a time series. You will see why in following labs.

Lets go ahead and create an array that holds the different dates the data was collected in:

In [None]:
dates =[]
for i in range(1950,2013):
    for j in range(1,13):
        dates.append(datetime.datetime(i, j, 1))
        
#taking a peak at the data        
print(dates[:5])

 Once we’ve done this, we can make a plot of the MEI index (remember to label your axes) and submit this figure:

In [None]:
plt.plot(dates,mei)
plt.title("Monthly MEI index from 1950-2012")
plt.ylabel("MEI index")
plt.xlabel("Date")
plt.show()

3. Import and Plot Your Data Set
Import the data set you downloaded in the prelab and generate a plot from it.
Since each of the data sets is organized differently, it is up to you to determine how to properly import it. However, feel free to show your data set to a lab TA and discuss challenges associated with importing the different data sets - some data sets are much easier to work with than others are.
Write an m-file which will do the following:
Load the data set you selected in the pre-lab assignment.
Assign the data to appropriate MATLAB variables.
Generate a plot of your data with any missing values removed. Give your plot an appropriate title and label your axes, including the correct units. If your data file contains more than one kind of measurement, choose only one variable to plot.
Once the plot is completed, write a caption for the plot. The caption should include:
A title for the graph
A description of what data the plot is showing
Axes labels including the units of the data
Any information necessary for interpreting the plot correctly.
Captions should be between 15 - 100 words long; the more complicated a plot is, the more description will be needed in the caption.
For example, a figure caption for the first plot we created might look like this:
Figure 1: Atmospheric CO2. CO2 concentrations in parts per million (ppm) recorded at the Mauna Loa Observatory from 1958 - 2011.