<a href="https://colab.research.google.com/github/cu-gwc-datascience/classes/blob/main/Week6_TimeSeriesData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Week 6: Time Series Data Visualization**

Girls Who Code @ Columbia

Class 3: Intro to Data Science

Week 6 curriculum by Marina Gemma (meg2203@columbia.edu). 


<img src="https://media4.giphy.com/media/LDaBbFGqUwjPq/giphy.gif?cid=ecf05e47wntryahj4oqkzftqc4iew3taxwp130s930ksev1j&rid=giphy.gif&ct=g" width="300" height="400">


## **Label and save your notebooks**

1. Click on "Copy to Drive"

2. Rename your notebook "Week6_YourName.ipynb"

## Learning Objectives:

* Learn about time series data
* Track trends through time
* Calculate daily, monthly, and annual averages
* Find relationships between variables

## What is time series data?

Time series data is a type of data (say, temperature for example) that is collected in the same way over a period of time. 

Today we will first explore **temperature** measurements that were collected in Central Park over a **nine year period**, from January 1, 2013 to December 31, 2021.  

As always, we will first import the packages we will use today:



In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#this sets a standard figure size
plt.rcParams['figure.figsize'] = (10,5)

Now, we will import our temperature data. We're calling it cpws because it comes from the **Central Park Weather Station (CPWS)**. However, you can name it something different if you'd like! 

Let's see what's in the dataset:

In [None]:
cpws = pd.read_csv('https://raw.githubusercontent.com/cu-gwc-datascience/classes/main/data/CPWS_2013_2021_edited.csv', index_col=None)
cpws.head(25)

Yikes! There are 104,652 rows of data! 

This is because ***hourly temperature measurements*** are reported in our dataset, along with ***daily and monthly averages*** for our nine year period (2013-2021). 

The first 23 lines report **hourly** temperature measurements for January 1, 2013. Note that the last displayed row (row 24) has a `REPORT_TYPE` of "`SOD`". This is the reported **daily** average for January 1, 2013. So the hourly measurements are followed by a daily average for each day, and each month of measurements is followed by a monthly average. 

Let's look at the first row of our pandas dataframe `cpws` using `.iloc[]`:

In [None]:
cpws.iloc[0]

We see that the value in the `DATE` column is reported as `2013-01-01T00:51:00`. This is a combination of the date of the measurement (January 1, 2013) and the time (00:51, or 12:51am). 

We can tell pandas to read the `DATE` column as a combination of date and time values using `pd.to_datetime()`:

In [None]:
cpws['DATE'] = pd.to_datetime(cpws['DATE'])
cpws['DATE']

Notice that the `dtype` at the bottom is now `dtype: datetime64[ns]`! 

Setting `cpws['DATE']` **equal** to `pd.to_datetime(cpws['DATE'])` tells pandas to replace the original `cpws['DATE']` column with the new, converted datetime column.

We want to examine the temperature data, so let's find where that is located by printing the dataframe column names:

In [None]:
cpws.columns

There's no column that just says ***temperature***. So what do we use? One way to find out is to explore the [documentation](https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/LCD_documentation.pdf) of the dataset. Here, the various reported values in our dataset are explained in detail.

Turns out `HourlyDryBulbTemperature`, `DailyAverageDryBulbTemperature`, or `MonthyMeanTemperature` are the values we are interested in -- they report the hourly, daily average, and monthly average temperatures, respectively. The *dry bulb temperature* is "commonly used as the standard air temperature", according to the National Oceanic and Atmospheric Administration (NOAA) [documentation](https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/LCD_documentation.pdf). Due to the formatting of our data file, we have to run the code below to make sure that pandas reads these columns as numeric values:

In [None]:
cpws['HourlyDryBulbTemperature'] = pd.to_numeric(cpws['HourlyDryBulbTemperature'], errors='coerce')
cpws['MonthlyTotalSnowfall'] = pd.to_numeric(cpws['MonthlyTotalSnowfall'], errors='coerce')
cpws['DailyPrecipitation'] = pd.to_numeric(cpws['DailyPrecipitation'], errors='coerce')
cpws['DailySnowfall'] = pd.to_numeric(cpws['DailySnowfall'], errors='coerce')
cpws['MonthlyTotalLiquidPrecipitation'] = pd.to_numeric(cpws['MonthlyTotalLiquidPrecipitation'], errors='coerce')

Now that we know our temperature data is good to go, we can try another quick and useful tool -- the pandas `.describe()` operation. This [generates descriptive statistics](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) for your dataframe. It is a broad but informative look at the distribution of our dataset:

In [None]:
cpws.describe()

To learn more about the values this returns, explore [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html). 

Let's plot the temperature measurements for the full nine year period. We can plot the `DATE` column on the x-axis, and the `HourlyDryBulbTemperature` on the y-axis:

In [None]:
plt.plot(cpws.DATE, cpws.HourlyDryBulbTemperature, marker='o')

plt.xlabel('Date')
plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, 2013-2021')

In the `plt.ylabel` line, we add `$^{\circ}$` to display the **degree** symbol - our temperature units are reported in **degrees Fahrenheit**.

There is a TON of data on this plot! Over 100,000 points. You can see the broad temperature cycle over the years - the temperatures get higher during the summer months, and lower during the winter months. Before we dive into long term trends, let's break this down and explore small scale trends first.

Last week we used **filters** to create subsets of our dataframe. We will use filters again this week, but in a slightly different format. 

Let's say we want to examine temperatures in the month of January. First, we call our dataframe as we normally do:

    cpws[]

In order to **only** show data from the month of January, we need to tell the dataframe to display data where the `MONTH` column contains `January`: 

    (cpws['MONTH'] == 'January')

We then combine our dataframe call `cpws[]` with the January data like so:

    cpws[(cpws['MONTH'] == 'January')]

See for yourself below:

In [None]:
january_temp = cpws[(cpws['MONTH'] == 'January')]
january_temp

This returns a dataframe subset that only contains rows where the `MONTH` column contains **January**.

We can plot this January temperature data on the y-axis, and the `DATE` column on the x-axis:


In [None]:
plt.plot(january_temp.DATE, january_temp.HourlyDryBulbTemperature)

plt.xlabel('Date')
plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, January 2013-2021')

BUT, making and naming a filter each time we want to plot something can be tedious. With pandas, we can filter our dataset **directly** in the plotting command, like so:

In [None]:
plt.plot(cpws[(cpws['MONTH'] == 'January')].DATE, cpws[(cpws['MONTH'] == 'January')].HourlyDryBulbTemperature)

plt.xlabel('Date')
plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, January 2013-2021')

We just replaced `january_temp` with the filter criteria `cpws[(cpws['MONTH'] == 'January')]`, and made the exact same plot. 

Voila! We can filter and plot in the same line of code. 


## Using the DatetimeIndex


An extremely powerful feature of pandas is the [DatetimeIndex](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html). Instead of using a numerical index like we have been (i.e. 0,1,2,3...), we can tell pandas to use our `DATE` column as the index of our `cpws` dataframe - allowing us to more easily search and plot our data based on the date it was collected. We do this by adding `.set_index('DATE')` to our `cpws` dataframe:

In [None]:
cpws = cpws.set_index('DATE')

Yay! Now we are using our date values as our index. You'll notice that the `DATE` column has moved from its original column position, and is now in the index column position:

In [None]:
cpws

Using the date as our index has simplified some things. 

We no longer need to plot the `DATE` column as our x-values. We can just add `.plot()` to the end of the dataframe column we are interested in (`HourlyDryBulbTemperature`) to create the same plots we made above:

In [None]:
cpws.HourlyDryBulbTemperature.plot()

plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, 2013-2021')

We can also add the filter directly to our plotting command as we did before:

In [None]:
cpws[(cpws['MONTH'] == 'January')].HourlyDryBulbTemperature.plot()

plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, January 2013-2021')

Do you agree that using:

    cpws[(cpws['MONTH'] == 'January')].HourlyDryBulbTemperature.plot()

instead of:

    plt.plot(cpws[(cpws['MONTH'] == 'January')].DATE, cpws[(cpws['MONTH'] == 'January')].HourlyDryBulbTemperature)

is easier? Thanks pandas!

We also don't have to include `plt.xlabel()` since pandas automatically labels the x-axis with the name of our index!

In the plot directly above, we are able to see the variability in January temperatures across the 9 year time period, but it's hard to see the variability across the month of January *itself*. To better see this, we can plot the temperature against the days in January.

We will want to isolate the temperatures for the month of January ***for each year*** -- we can do this by adding a second condition to our filter, which will filter by year, using the `&` sign:

In [None]:
cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2013)].HourlyDryBulbTemperature.plot()

plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, January 2013')

Be sure to keep `()` around each condition - the filter won't work without them!

We can do the same for the January 2014 data, by switching the year we put into the filter:

In [None]:
cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2014)].HourlyDryBulbTemperature.plot()

plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, January 2014')

But when we try to plot the January 2013 and January 2014 temperature profiles together:

In [None]:
cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2013)].HourlyDryBulbTemperature.plot()
cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2014)].HourlyDryBulbTemperature.plot()

plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, January 2013 and January 2014')

they won't plot on top of each other, since they are from different years.

To compare the January 2013 and 2014 data on the same plot, we will need to parse the data a bit differently.

## Using `.groupby()` in pandas

We will do this by using the pandas [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby) `.groupby()` operation, which will group our data according to an index property.

The notation is relatively straightforward:

    cpws.groupby(cpws.index.dayofyear)

where `dayofyear` can be replaced by the other attributes (month, year, etc.) listed [here](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html).

When we want to use **filtered data**, we can put the filter directly in the `cpws` call and in the `.groupby()` command -- we just use `cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2013)]` instead of only `cpws` in both places:

    cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2013)].groupby(cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2013)].index.dayofyear)


Then we will add `.mean().HourlyDryBulbTemperature.plot()` to the end of the above code to create our plot -- adding the `.mean()` operation will give us the **average** of the hourly temperatures per day (equivalent to the **daily average**) since we grouped by `dayofyear`. Let's see how this looks below:

In [None]:
cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2013)].groupby(cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2013)].index.dayofyear).mean().HourlyDryBulbTemperature.plot(marker='o')

plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, January 2013')

Here are the average daily temperatures for January 2013! We added `marker='o'` to the `.plot()` operation so that we could see where the point for each day falls on the plot. 

Let's add the average daily temperatures for January 2014 to the plot! We'll also add label names to the plot command, and add a legend so we can distinguish the two years from each other:

In [None]:
cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2013)].groupby(cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2013)].index.dayofyear).mean().HourlyDryBulbTemperature.plot(marker='o', label='January 2013')
cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2014)].groupby(cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == 2014)].index.dayofyear).mean().HourlyDryBulbTemperature.plot(marker='o', label='January 2014')

plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Temperature, January 2013 and January 2014')
plt.legend()

Which year had the **coldest** daily average temperature in January? On approximately what day did this temperature occur?

In [None]:
# your answer here

Now it's your turn -- compare the daily January temperature averages for 2020 and 2021. You will create almost the exact same plot as above, just changing the years to 2020 and 2021.

In [None]:
# your plot here

Which of these years had the **warmest** daily average temperature in January? On approximately what day did this occur?

In [None]:
# your answer here

What if we wanted to plot the January daily average temperatures for all 9 years? We *could* use one plotting command for each year, like we just did for the plots containing two years. OR we could use a `for` loop, and use one line of plotting code instead!

The structure of our plotting command will be the same, but we will iterate over a list of the years we want to plot. We'll define the `years` list below:

In [None]:
years = [2013,2014,2015,2016,2017,2018,2019,2020,2021]

And then create a `for` loop, replacing the individual year in `cpws['YEAR'] == 2013` with `cpws['YEAR'] == years[i]`:

In [None]:
for i in range(len(years)):
  cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == years[i])].groupby(cpws[(cpws['MONTH'] == 'January') & (cpws['YEAR'] == years[i])].index.dayofyear).mean().HourlyDryBulbTemperature.plot(marker='o', label='January '+str(years[i]))

plt.legend(bbox_to_anchor=(1.01, 1))
plt.ylabel('Temperature ($^{\circ}$F)')
plt.ylim([0,70])

And here we have the daily average January temperatures for all 9 years! We added `+str(years[i])` to the label name (`label='January '+str(years[i])`) in order to create individual entries for each of the 9 years in our legend, and added `bbox_to_anchor=(1.01, 1)` to our `plt.legend()` command to place the legend **outside** of the plot, since there isn't a lot of room left inside the plot. You can explore more ways to customize your plots [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html).

You'll notice this plot is hard to read though - the lines plot on top of each other, and it is hard to distinguish one year from another.

This is *not* a great example of data visualization. There is a lot of variability and a lot of overlap, making the data difficult to interpret. We can say that January temperatures for 2013-2021 fall between ~10 and 60 degrees Fahrenheit, but's it's hard to pick out more detailed observations.

Let's try visualizing **monthly averages** of the data instead, using `MonthlyMeanTemperature` in place of `HourlyDryBulbTemperature`. We'll start with 2013, as we did above. This time, we only need to filter for the year, 2013, and not the month, since we are interested in all the months. In the `.groupby()` operation, we change the index grouping from `dayofyear` to `month`: 

In [None]:
cpws[(cpws['YEAR'] == 2013)].groupby(cpws[(cpws['YEAR'] == 2013)].index.month).mean().MonthlyMeanTemperature.plot(marker='o', label='2013')

plt.xlabel('Month')
plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Monthly Average Temperature, 2013')

We added the x-label `plt.xlabel('Month')` back in to clarify that the 'DATE' being plotted was the month, not the number of days.

We can then add 2014 to the plot, along with a legend:

In [None]:
cpws[(cpws['YEAR'] == 2013)].groupby(cpws[(cpws['YEAR'] == 2013)].index.month).mean().MonthlyMeanTemperature.plot(marker='o', label='2013')
cpws[(cpws['YEAR'] == 2014)].groupby(cpws[(cpws['YEAR'] == 2014)].index.month).mean().MonthlyMeanTemperature.plot(marker='o', label='2014')


plt.legend()
plt.xlabel('Month')
plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Monthly Average Temperature, 2013 and 2014')

Which year (2013 or 2014) was, on average, warmer? Was it a lot warmer, or just a little bit warmer?

In [None]:
# your answer here

Now, use a `for` loop to plot the monthly average temperatures for **each year** of our dataset. You can use the same for loop structure that we used above, but don't forget to edit the plotting command to plot the monthly averages instead of the daily averages. 

The list `years` is already defined, so you don't have to make a list this time. Don't forget to label your axes and add a title and a legend!

In [None]:
# your plot here

What observations, if any, can you make about the monthly average temperatures in this plot?

In [None]:
# your answer here

Now let's explore the **yearly (or annual)** averages. 

First, we need to change the index grouping in the `.groupby()` operation from `month` to `year`.

Since there is no column in our dataset that tells you the annual temperature averages, we will still plot `MonthlyMeanTemperature`. However, through grouping by year and taking the mean, we are actually calculating the annual average temperature! So problem solved:

In [None]:
cpws[(cpws['YEAR'] == 2013)].groupby(cpws[(cpws['YEAR'] == 2013)].index.year).mean().MonthlyMeanTemperature.plot(marker='o', label='2013')
cpws[(cpws['YEAR'] == 2014)].groupby(cpws[(cpws['YEAR'] == 2014)].index.year).mean().MonthlyMeanTemperature.plot(marker='o', label='2014')


plt.legend()
plt.xlabel('Year')
plt.ylabel('Temperature ($^{\circ}$F)')
plt.title('Central Park Yearly Average Temperature, 2013 and 2014')

Two points on a plot don't tell us much. To see broader trends, we need to plot **more** annual averages.

Create a for loop below that plots the yearly averages for our dataset, using the list `years`. Again, you can use the same for loop structure for plotting as we did before, but update the plotting command with the appropriate index grouping and label. Don't forget your axis labels, legend, and title. 

In [None]:
# your plot here

1. What year had the highest average temperature? 
2. What year had the lowest average temperature? 
3. Did any years have approximately the same average temperature?


In [None]:
# 1. your answer here

# 2. your answer here

# 3. your answer here

Now try changing the limits of the y-axis using `plt.ylim()`. First, try `plt.ylim([50,60])`:

In [None]:
# recreate your plot and add a ylimit

Now try with `plt.ylim([40,70])`. 

Does changing the displayed range of values on the y-axis change your answers to questions 1-3 above? Why or why not?

In [None]:
# your reasoning here

We'll note here that there is a another way to create this plot *without* a for loop. Instead of using a for loop, we can **remove the filter** we used on `cpws` and plot the **annual averages** like this:

In [None]:
cpws.groupby(cpws.index.year).mean().MonthlyMeanTemperature.plot(marker='o')

plt.ylabel('Temperature ($^{\circ}$F)')

Then why use a for loop when you can make this plot with one line of code?

The main difference between this plot and your previous plot is that the points in this plot are all the **same color**. In the for loop, we were treating each year **individually**, which causes the for loop to assign each year its own color. However, if we don't filter our dataframe and instead plot all the years collectively, the plotting command treats them as the same series of data, and gives them all the same color.

Going forward, which method you use is up to you -- both plots are telling you the same information, but color coding is a way to help distinguish your data if you wish.

P.S. To remove the line between the points, add `linestyle=''` to the plotting command.

Remember, we are plotting **averages** of these values over a period of time (i.e. day or month or year). Below we will plot **totals** for a period of time. 

## When it rains, it pours!

Now let's explore something different. Our `cpws` dataset also contains measurements for **rainfall** and **snowfall**. The daily measurements are recorded as `DailyPrecipitation` and `DailySnowfall`. 


We can plot the `DailyPrecipitation` for 2020 like so:

In [None]:
cpws[(cpws['YEAR'] == 2020)].DailyPrecipitation.plot(marker='o')

But remember that if we want to plot data from two different years on top of each other, we need to use `.groupby()`:

In [None]:
cpws[(cpws['YEAR'] == 2020)].groupby(cpws[(cpws['YEAR'] == 2020)].index.dayofyear).mean().DailyPrecipitation.plot(marker='o', label='2020')
cpws[(cpws['YEAR'] == 2021)].groupby(cpws[(cpws['YEAR'] == 2021)].index.dayofyear).mean().DailyPrecipitation.plot(marker='o', label='2021')


plt.legend()
plt.xlabel('Day')
plt.ylabel('Precipitation (inches)')
plt.title('Central Park Precipitation, 2020 and 2021')

Which year (2020 or 2021) had the day with the **highest** precipitation?

In [None]:
# your answer here

You'll notice that many days on the above plot have a value of 0. This is because it doesn't rain everyday! With temperature, you will **always** have a measurement -- your record is continuous. However, with precipitation, you might have gaps in your data if there are days with no rain.

Now, let's look at precipitation **by month** for 2020 and 2021. The column `MonthlyTotalLiquidPrecipitation` reports the **TOTAL** precipitation (in inches) that was recorded for that month:

In [None]:
cpws[(cpws['YEAR'] == 2020)].groupby(cpws[(cpws['YEAR'] == 2020)].index.month).mean().MonthlyTotalLiquidPrecipitation.plot(marker='o', label='2020')
cpws[(cpws['YEAR'] == 2021)].groupby(cpws[(cpws['YEAR'] == 2021)].index.month).mean().MonthlyTotalLiquidPrecipitation.plot(marker='o', label='2021')


plt.legend()
plt.xlabel('Month')
plt.ylabel('Precipitation (inches)')
plt.title('Central Park Monthly Total Liquid Precipitation, 2020 and 2021')

Note the lack of a data point for March 2021. Either it didn't rain at all that month, or there is missing data! How might you go about finding out which it is?

In [None]:
# your answer here


Pick two other years, and plot the monthly precipitation totals:

In [None]:
# your plot here

Which month had the most precipitation? Was it the same for each year?

In [None]:
# your answer here

Now, let's plot the **total** yearly precipitation for all of the years in our dataset. We will use a for loop, but the syntax will be a bit different: 

* First, we will need to calculate the **SUM** of each year's `MonthlyTotalLiquidPrecipitation` values to get the total precipitation for that year. We will want to use a filter that filters the data by year `cpws[(cpws['YEAR'] == 2013)]`, and luckily, we can just add `.sum()` to the end of the filter and column name to calculate the total precipitation per year:

      cpws[(cpws['YEAR'] == 2013)].MonthlyTotalLiquidPrecipitation.sum()

* Instead of calling our pandas column and using `.plot()` at the end, we will use the `plt.plot()` command this time, and plot the list `years` as our x values.

Combining all of this, we produce:

In [None]:
for i in range(len(years)):
    plt.plot(years[i], cpws[(cpws['YEAR'] == years[i])].MonthlyTotalLiquidPrecipitation.sum(), marker='o', label=str(years[i]))

plt.legend()
plt.ylabel('Total Precipitation (inches)')



1.   Which year had the most amount of rain?
2.   Which year had the least amount of rain?



In [None]:
# 1. your answer here

# 2. your answer here

Your turn! Plot the total annual **snowfall** for each year in our dataset! Like we did above, do this using:

* a for loop that iterates over the list `years`
* the list `years` as your x-values
* `MonthlyTotalSnowfall` instead of `MonthlyTotalLiquidPrecipitation` to calculate the total snowfall with `.sum()`
* `plt.plot()` instead of `.plot()`
* `label=str(years[i])` to label your points

Don't forget to plot a title, legend, and axis labels!


In [None]:
# your plot here


1.   Which year had the most snow?
2.   Which year had the least snow?
3.   What is the *range* of total snowfall for the 9 years you plotted?

In [None]:
# 1. your answer here

# 2. your answer here

# 3. your answer here

Is there any kind of relationship that exists between the yearly precipitation totals and the yearly snowfall total? Is one higher when the other is lower, or vice versa? If so, why might that be?

In [None]:
# your answer here

## Plotting multiple measurements

Now let's try plotting **different measurements** from the **same time**. We will plot the average of wind speed (`HourlyWindSpeed`) and the monthly total of liquid precipitation (`MonthlyTotalLiquidPrecipitation`) for each month in the year 2020: 

In [None]:
cpws[(cpws['YEAR'] == 2020)].groupby(cpws[(cpws['YEAR'] == 2020)].index.month).mean().HourlyWindSpeed.plot(marker='o', label='Wind Speed', c='g')
cpws[(cpws['YEAR'] == 2020)].groupby(cpws[(cpws['YEAR'] == 2020)].index.month).mean().MonthlyTotalLiquidPrecipitation.plot(marker='o', label='Precipitation', c='b')

plt.legend()
plt.xlabel('Month')
plt.ylabel('Wind Speed (mph) \n Precipitation (inches)')

Notice that we have **two labels** on the y-axis. This is because we are plotting two different measurements -- one is **wind speed**, measured in miles per hour (mph), and the other is **precipitation**, measured in inches. In the `plt.ylabel()`, adding `\n` will create a new line in the label -- we included this so that `Wind Speed (mph)` and `Precipitation (inches)` aren't on the same line, making the y-label a bit easier to read. 

**Important Note**: We plotted the monthly **average** wind speed and the monthly **total** precipitation above. They are not *both* averages.

Now you plot the monthly average of wind speed (`HourlyWindSpeed`) and the monthly total of liquid precipitation (`MonthlyTotalLiquidPrecipitation`) for each month in the year 2013:


In [None]:
# your plot here

Do you notice a relationship between the monthly average wind speed and the monthly total precipitation? If yes, why do you think that is? If no, would you expect one? Why or why not?

In [None]:
# your answer here

Now plot the monthly average wind speed for each **year**:



*   Use a for loop that iterates over the list `years`
*   Filter the data by year: `cpws[(cpws['YEAR'] == years[i])]`
*   Use `.groupby()` to group the data by `month`
* After using `.groupby()`, take the mean `.mean()` before calling your column `HourlyWindSpeed` in orer to calculate monthly average wind speed
* Use `label=str(years[i])` in the plotting command, and don't forget to include a legend, title, and axis labels!



In [None]:
# your plot here

1. Is there a trend that you see in wind speed over the course of each year?
2. What months seem to always be the most windy?
3. What months seem to be the least windy?

In [None]:
# 1. your answer here

# 2. your answer here

# 3. your answer here

Just for fun, let's plot the monthly temperature averages along with the monthly wind speed averages. These measurements have different ranges of y-values:

In [None]:
cpws[(cpws['YEAR'] == 2013)].groupby(cpws[(cpws['YEAR'] == 2013)].index.month).mean().HourlyWindSpeed.plot(marker='o', label='Wind Speed', c='g')
cpws[(cpws['YEAR'] == 2013)].groupby(cpws[(cpws['YEAR'] == 2013)].index.month).mean().HourlyDryBulbTemperature.plot(marker='o', label='Temperature', c='r')
plt.legend()

You barely see any variation in the wind speed values in green because the wind speed values span a much shorter and smaller range than the temperature values in red. 

We can fix this **visualization** problem by using a **secondary axis**. We can add a second y-axis on the right hand side of the plot that has a different range of y-values so that we can see variation in both temperature and wind speed more clearly. 

To add a secondary y axis, we'll need to learn some new syntax. We will need to name both data series that we are plotting, and distinguish temperature as the values that will be plotted on the secondary y-axis:

    ax = cpws[(cpws['YEAR'] == 2013)].groupby(cpws[(cpws['YEAR'] == 2013)].index.month).mean().HourlyWindSpeed.plot(marker='o', label='Wind Speed', c='g')
    ax2 = cpws[(cpws['YEAR'] == 2013)].groupby(cpws[(cpws['YEAR'] == 2013)].index.month).mean().HourlyDryBulbTemperature.plot(secondary_y=True, ax=ax, marker='o', label='Temperature', c='r')

We named the wind speed plotting command "`axis1`" and the temperature plotting command "`axis2`". We also added `secondary_y=True` and  `ax=axis1,` to the temperature plotting command in order to assign it to the secondary y-axis. 

We can then label each y axis using individual commands, and make the text color match the data color if we wish:

    axis1.set_ylabel('Wind Speed (mph)', c='g')
    axis2.set_ylabel('Temperature ($^{\circ}$F)', c='r')

Combining all of these commands, we can create our plot"

In [None]:
axis1 = cpws[(cpws['YEAR'] == 2013)].groupby(cpws[(cpws['YEAR'] == 2013)].index.month).mean().HourlyWindSpeed.plot(marker='o', label='Wind Speed', c='g')
axis2 = cpws[(cpws['YEAR'] == 2013)].groupby(cpws[(cpws['YEAR'] == 2013)].index.month).mean().HourlyDryBulbTemperature.plot(secondary_y=True, ax=axis1, marker='o', label='Temperature', c='r')

axis1.set_ylabel('Wind Speed (mph)', c='g')
axis2.set_ylabel('Temperature ($^{\circ}$F)', c='r')

1. What relationship between **wind speed** and **temperature** do you observe?
2. Would you have been able to see this relationship without the secondary y-axis? Why or why not?

In [None]:
# 1. your answer here

# 2. your answer here

# Finding values in your dataset

Suppose you want to find the day with the highest total precipitation in our 9 years of data. Instead of plotting every day for every year and finding the largest value by looking at a plot, you can use `.nlargest()` to return the largest value in a given column:

    cpws['column_name'].nlargest()

The `.nlargest()` operation will automatically return the **5** largest values in that column. However, you can specify how many values you want it to return using `n=` inside the parentheses:

    cpws['column_name'].nlargest(n=2)

Where you can replace `2` with any number you'd like. You can use `.nsmallest()` in exactly the same way to find the smallest values in your column.

If you **only** want the largest (maximum) value, you can use `.max()` instead:

    cpws['column_name'].max()

This just returns the maximum value. You can use `.min()` in exactly the same way to return the minimum value. 

To find the date that corresponds to your value, you can use:

    cpws[cpws['column_name'] == value]

Try finding the values listed below: 

In [None]:
# find the largest hourly temperature value (using HourlyDryBulbTemperature)


In [None]:
#find the date that corresponds to the largest hourly temperature value


In [None]:
# find the *smallest* hourly temperature value


In [None]:
# find the date that corresponds to the *smallest* hourly temperature value


In [None]:
#find the day that had the greatest total snowfall (using DailySnowfall)


In [None]:
#find the date that corresponds to the greatest total daily snowfall value


In [None]:
# find the largest hourly wind gust (HourlyWindGustSpeed)


In [None]:
# find the date that corresponds to the largest hourly wind gust


## Plotting the daily average temperature for your birthday!

Our last exercise today will be plotting the **daily average temperature** for your birthday over the past nine years. We will use a `for` loop to create this plot!

We will iterate over the list `years` (which lists all 9 years in the dataset) in our for loop.

First, we will create a filter that filters out the month and day of your birthday. Your filter will have **3** conditions: 

      (cpws['YEAR'] == years[i]) 
      (cpws['MONTH'] == 'your bday month') #remember strings need to be in quotations!
      (cpws['DAY'] == your bday)

and when combined, will look like:

    cpws[(cpws['YEAR'] == years[i]) & (cpws['MONTH'] == 'your bday month') & (cpws['DAY'] == your bday)]


You do **not** need to use `.groupby()`. You'll want to plot the `DailyAverageDryBulbTemperature` for each year. Don't forget your axis labels and plot title!

Extra credit: try assigning a colormap to your plot!

In [None]:
# your birthday plot here



1.   What year had the highest temperature on your birthday?
2.   What year had the lowest temperature on your birthday?
3.   Are there any years that had nearly the same temperature on your birthday?


Try again with another date if you'd like!

In [None]:
# plot average temperature for another day of the year!

# That's all for today!