# Tutorial 3: Introduction to Plotting with Matplotlib

In the previous module, we learned about data quality control, selection, and manipulation. An important step in each of these tasks is plotting. Sometimes, we want to plot the data before we've done quality control. This can help us see what might be missing from the data or what might be an outlier. Data visualization is also very helpful for checking that we correctly selected or manipulated our data. Once we have out results, we then need to plot the data so that others can see what we've done! For these reasons, and many more, being confident in your ability to plot data is very important in scientific research. 

In this tutorial, we will get an introduction to working with Matplotlib.pyplot and know how to make line plots from pandas data. By the end of this tutorial, you will be able to: 
* identify the different components of code necessary for plotting
* implement changes to existing code to adapt a plot
* analyze error messages and fix code to resolve errors with plotting

### What is Matplotlib?

Matplotlib is a Python package for plotting and visualization of data. It is not specific to scientific data and plotting, so there are some very cool visualization tools to explore if you want. Here is a list of the plotting techniques available in Matplotlib: https://matplotlib.org/stable/tutorials/introductory/sample_plots.html#sphx-glr-tutorials-introductory-sample-plots-py

Within Matplotlib, there are several modules that do different things. *pyplot* does the basic plotting functions, and that is the module we'll be working with in this tutorial. There are other modules for more complex plotting tools. For example, *color* works with with making colorbars, the colorcoding of your figures, and other related color-based commands. You can plot a colorbar using pyplot, but you can't adapt it to suit your needs. Modules like *patches* and *lines* are useful for adding features on top of a figure.

For this tutorial, we will be working with pandas for data manipulation and Matplotlib.pyplot for data visualization.

### Data Preparation

In [None]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np # just in case
import calendar # a built-in python package with all the basic calendar and date information

In [None]:
# we will work with the full dataset from tutorial 2. This is the same data, but combined into one file.
# we can do the same quality control and manipulation we tested out last tutorial.
df = pd.read_csv('truax_2019.csv')

In [None]:
# remember to look at your data first!
df.head(10)

Let's turn the separate year, month, day, hour columns into a single datetime. This will become our x-data. 

We can use the loc command like we did in the previous tutorial, and this time we will specify both index and column conditions in .loc[index,column]. Alternatively, we could use the iloc command. This command works slightly differently than loc; iloc requires integer inputs, loc requires string inputs. In other words, loc is label-based and iloc is interger-based. It is easier to use loc when you know the name of the column you want, but when you do not know th name, you can also use iloc. To see how they are used differently, check out the code cell below.

In [None]:
# first, use the loc command to select the columns with date and time. We will use : to specify all indexes
# we then select all columns between year and hour using another : and the column names
date_sep_loc = df.loc[:,'year':'hour']

# now the same thing with iloc
date_sep_iloc = df.iloc[:,0:4]

print(date_sep_loc.head(10))
print(date_sep_iloc.head(10))

They look the same to me!

In [None]:
# next, turn these separate column values into a pandas datetime object
datetime = pd.to_datetime(date_sep_loc)

# finally, all this back to the original DataFrame
df['datetime'] = datetime
df.head(10)

### Simple Line Plots

In [None]:
# now let's choose a variable to plot
var = 'airtemp_degc' # change the variable here if you want!

In [None]:
# now we should do our quality control by selecting where our variable is not NaN
df_var1 = df.dropna(axis='index',subset=[var])
df_var1

**Knowledge Check** Why did we create a new DataFrame when removing the NaNs? We could have used inplace=True but didn't. What benefit is there to creating a new DataFrame instead of modifying the original one?

In [None]:
# let's make a line plot of the data to see how it looks
ydata = df_var1[var]
xdata = df_var1['datetime']
plt.plot(xdata,ydata)                      # this makes a line plot of xdata versus ydata
plt.show()

How did your plot turn out? Based on what you know of the variable you chose, does the line plot make sense?

**Knowledge Check** Go back and select a different variable. Repeat the quality control and plotting steps. Does the line plot still make sense? Do you need to do any more quality control?

That plot we made is rather bland. As a check on quality control, it's probably fine. But what if we wanted to present this data or use it in a poster? What do we need to do to make it look fancy?

In [None]:
plt.plot(xdata,ydata)                      # this makes a line plot of xdata versus ydata
plt.xlabel('Date')                         # this adds a label to your x-axis
plt.ylabel('Temperature (˚C)')             # y-axis label, change if you selected a different variable
plt.title('Hourly Temperature in 2019')    # the title, change if you selected a different variable
plt.show()

Ok, so that looks a little better. But the size of the figure is really small. And the font size is also small, which can make it hard to see things on a poster or in a presentation.

In [None]:
fig = plt.figure(figsize=(10,8)) # before making the plot, we specify the (width,height) of the whole figure
plt.plot(xdata,ydata)                      
plt.xlabel('Date', fontsize=14)                         # we specify the fontsize as 14 for the axes labels
plt.ylabel('Temperature (˚C)', fontsize=14)
plt.xticks(fontsize=12)                                 # the tick markers can also have a set fontsize
plt.yticks(fontsize=12)
plt.title('Hourly Temperature in 2019', fontsize=18)    # the title is now size 18 font
plt.show()

Better. This is a figure we could use in a presentation for group meeting. Let's save it so we can present it later.

In [None]:
fig = plt.figure(figsize=(10,8)) 
plt.plot(xdata,ydata)                      
plt.xlabel('Date', fontsize=14)                         
plt.ylabel('Temperature (˚C)', fontsize=14)
plt.xticks(fontsize=12)                                 
plt.yticks(fontsize=12)
plt.title('Hourly Temperature in 2019', fontsize=18)
plt.savefig('mylineplot.png')  # instead of plt.show(), we used the savefig('filename.filetype') command
# you can also save the figure to a different directory by changing the path in savefig('path/filename.filetype')
# Note: savefig command MUST come before show() if you write in both. Otherwise, your saved fig will be blank

### Plotting Multiple Lines

What if we want to compare two variables in the same plot? That's easy! Just plot two lines!

In [None]:
# let's focus on temperature for now. If you changed the variables above, this code will reset everything.
var = 'airtemp_degc'
df_var1 = df.dropna(axis='index',subset=[var])

# select a second variable
var2 = 'dewpoint_degc'
# quick quality control
df_var2 = df_var1.dropna(axis='index',subset=[var2])    # pay attention to which DataFrame we used!

In [None]:
# variables for plotting
xdata = df_var2['datetime']
ydata1 = df_var2[var]
ydata2 = df_var2[var2]

In [None]:
fig = plt.figure(figsize=(10,8)) 
plt.plot(xdata,ydata1,color='red',label='Temperature (˚C)')         # plot the temperature data in red
plt.plot(xdata,ydata2,color='blue',label='Dewpoint (˚C)')           # plot the dewpoint data in blue
plt.xlabel('Date', fontsize=14)                         
plt.ylabel('Temperature (˚C)', fontsize=14)
plt.xticks(fontsize=12)                                 
plt.yticks(fontsize=12)
plt.title('Hourly Temperature and Dewpoint in 2019', fontsize=18)
plt.legend(fontsize=12)                                             # put a legend in the figure
plt.show()

Nice! This is exciting, we're making presentation-quality plots now! 

Now that you've made high quality line plots, let's try making a different type of plot. Based on the data that we have (daily data), let's try scatter plots.

In [None]:
# select the data, remove NaNs, and get the x and y data
var = 'liqprec_mm'
df_var = df.dropna(axis='index',subset=[var])
ydata = df_var[var]
xdata = df_var['datetime']

# make a scatter plot. What do you notice is different about this code compared to plt.plot?
fig = plt.figure(figsize=(10,8)) 
plt.scatter(xdata,ydata,s=10,color='blue',label='Precipitation (mm)')
plt.xlabel('Date', fontsize=14)                         
plt.ylabel('Accumulated Precipitation (mm)', fontsize=14)
plt.xticks(fontsize=12)                                 
plt.yticks(fontsize=12)
plt.title('Amount of Precipitation in 2019', fontsize=18)
plt.legend()
plt.show()

The scatter command still needs x and y data like plt.plot, but you can also change the size and shape of the dots (did you notice the s=10 argument?). Check out this link: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html and this link: https://matplotlib.org/stable/api/markers_api.html. Now try to change the size and shape (and colors!) of the dots on your own!

One thing you may have noticed with the scatter plot is that it looks bulky and busy around 0 mm. That is because there are a bunch of dots all overlapping. We can't really see any dots at 0.1 mm because all the 0.0 mm dots cover that information up. Let's do some quality control to remove those 0 mm dots!

In [None]:
# let's try the loc command to find precip days greater than zero first 
ydata = df_var.loc[df_var[var]>0.0]
ydata

In [None]:
# make a scatter plot without 0 mm precip
fig = plt.figure(figsize=(10,8)) 
plt.scatter(xdata,ydata,s=10,color='blue',label='Precipitation (mm)')
plt.xlabel('Date', fontsize=14)                         
plt.ylabel('Accumulated Precipitation (mm)', fontsize=14)
plt.xticks(fontsize=12)                                 
plt.yticks(fontsize=12)
plt.title('Amount of Precipitation in 2019', fontsize=18)
plt.legend()
plt.show()

What is that error telling us? At the bottom it says "<font color=red>ValueError</font>: x and y must be the same size." And at the top, there is a green arrow pointing at the plt.scatter line of code. This means that in the scatter code, the xdata and the ydata are not the same length. Why not?

**Knowledge Check** Why do you think the xdata and ydata are not the same length? They were the same length when we did the first scatter plot. How did we modify the data? Come up with a hypothesis before moving on to the next cell.

In [None]:
# use the .where command to replace values of precip that are not greater than 0
# the format of this command is .where(condition,replace_value). If condition is false, the replace_value is used
ydata = df_var[var].where(df_var[var]>0.0,np.nan)

# now let's check lengths of data!
print(len(ydata))
print(len(xdata))

What did we do differently? Well, the first time, we **dropped** all the 0.0 mm values from the DataFrame. This meant that the DataFrame got shorter, so the length of xdata and ydata were different. In the second instance, we **replaced** all 0.0 mm values with np.nan. This preserved the length of the DataFrame. 

Excellent! Let's plot!

In [None]:
# make a scatter plot without 0 mm precip
fig = plt.figure(figsize=(10,8)) 
plt.scatter(xdata,ydata,s=10,color='blue',label='Precipitation (mm)')
plt.xlabel('Date', fontsize=14)                         
plt.ylabel('Accumulated Precipitation (mm)', fontsize=14)
plt.xticks(fontsize=12)                                 
plt.yticks(fontsize=12)
plt.title('Amount of Precipitation in 2019', fontsize=18)
plt.legend()
plt.show()

Awesome! This figure looks good. We can definitely learn some stuff from this plot, like that precipitation is higher in the summer (07 to 09). Does this make sense for Madison, WI?

### Exercises

Now lets do some practice! Practice makes perfect, right? Please try your best first, but if you get stucked, scroll down and find some help.

<b>Exercise 1: Plot at least two variables on the same figure using .plot for one and .scatter for the other. Change the colors, markers, and linestyles and add a legend to the figure. 

<b>Exercise 2: We have already looked at the weather data on an hourly basis, can you compare two variables on monthly basis? 
    
* Hint: use .groupby() to get daily.monthly data

<b>Exercise 3: What is the diurnal pattern of the summer temperature? When does the highest daily temperature occur in summer?
* Hint: diurnal means daily. So you want to find the average daily pattern of temperature. What does temp look like, on average, at 1 am, 2 am, 3 am, etc.?

### Possible Solutions

#### Exercise 1:

In [None]:
var1 = 'windspeed_ms'
var2 = 'slp_hpa'
df1 = df.dropna(axis='index',subset=[var1,var2])   # you can drop NaNs on more than one column at a time!
df1

In [None]:
xdata = df1['datetime']
ydata1 = df1[var1]
ydata2 = df1[var2]

In [None]:
# just checking . . . 
fig = plt.figure(figsize=(10,8))
plt.plot(xdata,ydata2,color='red')
plt.scatter(xdata,ydata1,color='blue')
plt.show()

Doesn't look great. Let's rescale the pressure data. And remove all the 0.0 m/s wind speed data.

In [None]:
ydata2_mean = ydata2.mean()
ydata2_rescale = ydata2 - ydata2_mean
ydata1_nozeros = df1[var1].where(df1[var1]>0.0,np.nan)

In [None]:
fig = plt.figure(figsize=(10,8))
plt.plot(xdata,ydata2_rescale,color='red')
plt.scatter(xdata,ydata1_nozeros,color='blue',s=10,marker='+')
plt.show()

**Note:** This figure is not presentation-quality. There are no labels and the y-axis corresponds to two different units. However, as an exercise, it's fine to explore plotting different types of data.

<b>Exercise 2: 

In [None]:
# get monthly temperature and dew point using .groupby()
month_data = df_var.groupby(by=df_var.month).mean()

In [None]:
# since we are plotting monthly data, xdata would be the 12 months
xdata = np.arange(12) # range from 0 to 11
ydata1 = month_data['airtemp_degc']
ydata2 = month_data['dewpoint_degc']

In [None]:
# try calendar.month_name[i] command if you do not want to type names of the months
# calendar.month_abbr gives you the abbreviations
months = calendar.month_abbr[1:13]
months

In [None]:
fig = plt.figure(figsize=(10,8)) 
plt.plot(xdata,ydata1,color='red',label='Temperature (˚C)')         # plot the temperature data in red
plt.plot(xdata,ydata2,color='blue',label='Dewpoint (˚C)')           # plot the dewpoint data in blue                        
plt.ylabel('Temperature (˚C)', fontsize=14)
plt.xticks(xdata,months,fontsize=12)                                # set tick labels as month name            
plt.yticks(fontsize=12)
plt.title('Monthly Temperature and Dewpoint in 2019', fontsize=18)
plt.legend()                                                        # put a legend in the figure
plt.show()

<b>Exercise 3: 

In [None]:
# first step, let's select temperatures for summer.
df_var1 = df_var.set_index("datetime")                        
summer_temp = df_var1.loc['2019-06-01':'2019-08-31']

# then, lets group data by hours to get diurnal pattern
summer_temp2 = summer_temp.groupby(by=summer_temp.index.hour).mean()

# get hourly temperatures averaged over summer 
diurnal_summer_temp = summer_temp2['airtemp_degc']

In [None]:
summer_temp2

In [None]:
# we want our x data to be the hour, so we can just select the index
xdata = summer_temp2.index.values

In [None]:
fig = plt.figure(figsize=(10,8)) 
plt.plot(diurnal_summer_temp,color='red',label='Temperature (˚C)')         # plot the temperature data in red
plt.ylabel('Temperature (˚C)', fontsize=14)
plt.xlabel('Hour',fontsize=14)
plt.xticks(xdata,fontsize=12)  
plt.yticks(fontsize=12)
plt.title('Diurnal pattern of Summer Temperature in 2019', fontsize=18)
plt.show()

Weird, why is the temperature coldest at 11 am? Wait, feel familiar with this issue? 
Yes! This has happened in tutorial 2! We know why! The time is in UTC! We could fix that by making the DatetimeIndex time-zone aware and then converting the time zone. 

In [None]:
df_localized = df_var1.tz_localize(tz='UTC') # make the DataFrame time-zone aware
df_new = df_localized.tz_convert(tz='US/Central') # convert from UTC to Central time

In [None]:
# redo the averging calculations for the new time-zone aware data
summer_temp_new = df_new.loc['2019-06-01':'2019-08-31']
summer_temp_new2 = summer_temp_new.groupby(by=summer_temp_new.index.hour).mean()
diurnal_summer_temp2 = summer_temp_new2['airtemp_degc']
xdata = summer_temp_new2.index.values

In [None]:
# find the time when maximum temperature occurs during summer; np.where() is a useful command here
time_of_maxtemp = np.where(diurnal_summer_temp2 == diurnal_summer_temp2.max())[0][0]

print("Maximum temperature usually occurs at %.i:00 during summer" %time_of_maxtemp)

In [None]:
# plot the figure and add a marker for the max temperature
fig = plt.figure(figsize=(10,8)) 
plt.plot(diurnal_summer_temp2,color='red',label='Temperature (˚C)')         # plot the temperature data in red
plt.scatter(time_of_maxtemp,diurnal_summer_temp2.max(),color='k',marker="*",s=200,label='Max Temp')
plt.ylabel('Temperature (˚C)', fontsize=14)
plt.xticks(xdata,fontsize=12)                                # set thick label as month name            
plt.yticks(fontsize=12)
plt.title('Diurnal pattern of Summer Temperature in 2019', fontsize=18)   
plt.legend()
plt.show()