# Time-based Data

This lesson, we'll be working with some of the ways that Python/Pandas can manipulate data based upon a time index. But, like everything we've done, it doesn't always start out the way we want it.

In [None]:
import pandas as pd
import numpy as np

The next two links are the data, and a README file that describes the data format. To keep it somewhat close-to-home, the data contained in the first link is from Durham, NC.

https://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2018/CRNS0101-05-2018-NC_Durham_11_W.txt

Clicking on that link, you'll see that there is a bunch of columns, but no headers. It's divided into fixed-width columns, but not with commas, or other single characters. Let's see what the pandas default does with this kind of text:

In [None]:
pd.read_csv(r'./CRNS0101-05-2018-NC_Durham_11_W.txt').head()

That's not all that useful. Let's see what the README has to say about it.

https://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/README.txt

If you scroll down to section 5, you'll see a bit where it describes the columns, but it's in what is essentially a text file. Can we get that into something we'd like? Of course! With Regular Expressions!

![Weather Headers](weather_headers.png "Weather Headers!")

Looking at this list, it's pretty easy to see the information we need, and we should be able to formulate a regular expression to extract the middle column. Let's select the info in the browser; copy and paste into our text editor. We'll work in there. 

In [None]:
headers = pd.read_csv(r'./weather_headers.csv',header=None,squeeze=True)

We'll import the file we just created, but add a new keyword argument, "squeeze". This allows the imported single column to be treated as a pandas Series, rather than a pandas DataFrame. This affects the formatting later.

In [None]:
headers

In [None]:
# pd.read_csv?

Now, we'll import the data from the text file doing the following:
* split on another regular expression, specifically `\s+` meaning one or more whitespace characters.
* treat the first line as data.
* use the 'headers' from above as the headers of the columns
* combine the local date and time into a single datetime field

In [None]:
noaa_data = pd.read_csv(r'./CRNS0101-05-2018-NC_Durham_11_W.txt',delimiter='\s+',header=None,names=headers.values,parse_dates=[['LST_DATE','LST_TIME']])

Previewing the data, this looks useful.

In [None]:
noaa_data.head()

Since these samples occur about every 5 minutes, we'll change the index to be the datetime. This will allow for some other functionality.

In [None]:
noaa_data.set_index('LST_DATE_LST_TIME',inplace=True)

In [None]:
noaa_data.head()

Using the `.groupby()` method, and an aggregate function, we can start to see some grouped data. We're also introducing `pd.Grouper`, a method for grouping by characteristics such as time. In this case, Pandas will take the data separated by 5 minute increments, group by some time frequency, and then apply a function to that group. Here we're grouping by week and getting the average air temperature for that week:

In [None]:
noaa_data.loc[:,'AIR_TEMPERATURE'].groupby(pd.Grouper(freq='w')).mean()

Something is going on in April, May, or June. Is it really that close to absolute zero at the end of spring?

In [None]:
noaa_data[(noaa_data.index >= '2018-05-27') & (noaa_data.index < '2018-06-03')].loc[:,'AIR_TEMPERATURE']

Just looking at the list of data, we're not seeing anything out of the ordinary. Let's mask the data for values less than absolute zero, when applied to air temperature, and use `np.unique()` to get the date(s) associated with that.

In [None]:
np.unique(noaa_data[noaa_data['AIR_TEMPERATURE'] < -273.15]['AIR_TEMPERATURE'].index.date)

Three days in 2018 had air temperatures lower than absolute zero. I think that would have made news. Let's look at a histogram to see what our distribution is for temperatures like that.

In [None]:
np.histogram(noaa_data[(noaa_data.index >= '2018-05-29') & (noaa_data.index < '2018-05-30')].loc[:,'AIR_TEMPERATURE'])

And while we're at it, let's transition to our favorite thing: indexers!

In [None]:
np.histogram(noaa_data.loc["2018-05-29":"2018-05-29","AIR_TEMPERATURE"])

Looking at these bins, we can see a really weird distribution. Most of the data is in the rightmost bin, with temperatures being at or below 26.9 degrees celcius. But there are a lot of -9999 values. We know this to be incorrect data. In fact, this is indicated in the notes of our specification document:
* C.  Missing data are indicated by the lowest possible integer for a given column format, such as -9999.0 for 7-character fields with one decimal place or -99.000 for 7-character fields with three decimal places.

We don't always have specifications for errors, so it's good to have a couple of ways to look at where some outliers might make our data messy.

In [None]:
noaa_data['AIR_TEMPERATURE'].replace(-9999,np.nan,inplace=True)

Here we'll replace the invalid data with `np.nan`. Even though `np.nan` is invalid data, it give some indication to functions that it should be omitted. Check out this function:

In [None]:
# noaa_data.mean?

By default, `.mean()` skips null/nan values. Look at how the following three examples work:

In [None]:
pd.Series([1,2,3]).mean()

In [None]:
pd.Series([1,2,3,np.nan]).mean()

In [None]:
pd.Series([1,2,3,np.nan]).mean(skipna=False)

Now, let's reapply based upon our fixed data. In theory, the values should be more in accordance with our expectations. We're dropping `np.nan` from our calculated mean.

In [None]:
noaa_data.loc[:,'AIR_TEMPERATURE'].groupby(pd.Grouper(freq='W')).mean()

But be careful, now that we have `np.nan` in our data, the histogram from above might be broken.

In [None]:
np.histogram(noaa_data.loc["2018-05-29":"2018-05-29","AIR_TEMPERATURE"])

We can explicitly use the `np.nanmin()` and `np.nanmax()` functions to find the minimum and maximums for a range, ignoring NaN.

In [None]:
noaa_day = noaa_data.loc["2018-05-29":"2018-05-29","AIR_TEMPERATURE"]
np.histogram(noaa_day,range=(np.nanmin(noaa_day),np.nanmax(noaa_day)))

One benefit of using the datetime values as an index, is that we can groupby properties of those dates. Ever need to look at data grouped by hour of the day? What was the average temperature for each hour in the month of July?

In [None]:
noaa_month = noaa_data.loc["2018-07-01":"2018-08-01"]
noaa_month.loc[:,'AIR_TEMPERATURE'].groupby(noaa_month.index.hour).mean()

What is the hottest average day of the year (so far)? (0 = Monday)

In [None]:
noaa_data['AIR_TEMPERATURE'].groupby(noaa_data.index.dayofweek).mean()

Let's take a moment to make it a little more readable:

In [None]:
from calendar import day_name
pd.DataFrame({'Day':[day_name[i] for i in range(7)],
            'Avg. Temp':noaa_data['AIR_TEMPERATURE'].groupby(noaa_data.index.dayofweek).mean()},
             columns=['Day','Avg. Temp']
).set_index('Day').T

Now lets look at a situation where `.sum()` might be more appropriate, Precipitation.

In [None]:
noaa_data.loc[:,'PRECIPITATION'].groupby(pd.Grouper(freq='W')).sum()

We're still seeing the values affected by the invalid entries. Adding -9999 to a value potentially every 5 minutes can really throw off our analysis. We've seen how to replace a single value with `np.nan`, and that is almost certainly what we'll do here. However, if we're talking about an amount of precipitation in a 5 minute period, _any_ negative number could potentially be an invalid value. Let's investigate using ranges with start/end/steps as a way to leverage the power of replace.

In [None]:
some_numbers = [10,8,3,1,5,-5,2,-15,-4,5,-2,-1,-3,-5]
pd.Series(some_numbers).replace(range(-5,-1,2),0)

To help with visualization, let's put the results next to the original numbers.

In [None]:
old_numbers = [10,8,3,1,5,-5,2,-15,-4,5,-2,-1,-3,-5]
new_numbers = pd.Series(old_numbers).replace(range(1,10,1),0)
pd.concat([pd.Series(old_numbers),pd.Series(new_numbers)],axis=1)

Now that we've experimented with values in `.replace(range(x,y,z),n)` let's use that to change all negative numbers for precipitation to `np.nan`.

In [None]:
noaa_data['PRECIPITATION'].replace(range(-9999,0),np.nan,inplace=True)

And just to verify that precipitation are discrete values, and not 5 minute cumulative numbers, let's take a look at a specific range to see how it behaves. We'll leverage some more functionality with indexers.

In [None]:
noaa_data.loc["2018-06-10 22:00:00":"2018-06-10",'PRECIPITATION']

Here we see that the indexers are smart enough to include time with the date. This segment was deliberately picked to show that it appears that the rainfall is per-five-minute-segment, instead of cumulative. That makes the following grouping using `.sum()` more likely to be a reasonable statistic:

In [None]:
noaa_data.loc[:,'PRECIPITATION'].groupby(pd.Grouper(freq='W')).sum()