# Time-based Data

This lesson, we'll be working with some of the ways that Python/Pandas can manipulate data based upon a time index. But, like everything we've done, it doesn't always start out the way we want it.

In [1]:
import pandas as pd
import numpy as np

The next two links are the data, and a README file that describes the data format. To keep it somewhat close-to-home, the data contained in the first link is from Durham, NC.

https://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2018/CRNS0101-05-2018-NC_Durham_11_W.txt

Clicking on that link, you'll see that there is a bunch of columns, but no headers. It's divided into fixed-width columns, but not with commas, or other single characters. Let's see what the pandas default does with this kind of text:

In [15]:
np.__version__

'1.14.3'

In [2]:
pd.read_csv(r'./CRNS0101-05-2018-NC_Durham_11_W.txt').head()

Unnamed: 0,03758 20180101 0005 20171231 1905 2 -79.09 35.97 -4.3 0.0 0 0 -3.8 C 0 22 0 0.265 4.4 1217 0 1.76 0
0,03758 20180101 0010 20171231 1910 2 -79....
1,03758 20180101 0015 20171231 1915 2 -79....
2,03758 20180101 0020 20171231 1920 2 -79....
3,03758 20180101 0025 20171231 1925 2 -79....
4,03758 20180101 0030 20171231 1930 2 -79....


That's not all that useful. Let's see what the README has to say about it.

https://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/README.txt

If you scroll down to section 5, you'll see a bit where it describes the columns, but it's in what is essentially a text file. Can we get that into something we'd like? Of course! With Regular Expressions!

![Weather Headers](weather_headers.png "Weather Headers!")

Looking at this list, it's pretty easy to see the information we need, and we should be able to formulate a regular expression to extract the middle column. Let's select the info in the browser; copy and paste into our text editor. We'll work in there. 

In [3]:
headers = pd.read_csv(r'./weather_headers.csv',header=None,squeeze=True)

We'll import the file we just created, but add a new keyword argument, "squeeze". This allows the imported single column to be treated as a pandas Series, rather than a pandas DataFrame. This affects the formatting later.

In [4]:
headers

0                  WBANNO
1                UTC_DATE
2                UTC_TIME
3                LST_DATE
4                LST_TIME
5                  CRX_VN
6               LONGITUDE
7                LATITUDE
8         AIR_TEMPERATURE
9           PRECIPITATION
10        SOLAR_RADIATION
11                SR_FLAG
12    SURFACE_TEMPERATURE
13                ST_TYPE
14                ST_FLAG
15      RELATIVE_HUMIDITY
16                RH_FLAG
17        SOIL_MOISTURE_5
18     SOIL_TEMPERATURE_5
19                WETNESS
20               WET_FLAG
21               WIND_1_5
22              WIND_FLAG
Name: 0, dtype: object

In [5]:
# pd.read_csv?

Now, we'll import the data from the text file doing the following:
* split on another regular expression, specifically `\s+` meaning one or more whitespace characters.
* treat the first line as data.
* use the 'headers' from above as the headers of the columns
* combine the local date and time into a single datetime field

In [6]:
noaa_data = pd.read_csv(r'./CRNS0101-05-2018-NC_Durham_11_W.txt',delimiter='\s+',header=None,names=headers.values,parse_dates=[['LST_DATE','LST_TIME']])

Previewing the data, this looks useful.

In [7]:
noaa_data.head()

Unnamed: 0,LST_DATE_LST_TIME,WBANNO,UTC_DATE,UTC_TIME,CRX_VN,LONGITUDE,LATITUDE,AIR_TEMPERATURE,PRECIPITATION,SOLAR_RADIATION,...,ST_TYPE,ST_FLAG,RELATIVE_HUMIDITY,RH_FLAG,SOIL_MOISTURE_5,SOIL_TEMPERATURE_5,WETNESS,WET_FLAG,WIND_1_5,WIND_FLAG
0,2017-12-31 19:05:00,3758,20180101,5,2,-79.09,35.97,-4.3,0.0,0,...,C,0,22,0,0.265,4.4,1217,0,1.76,0
1,2017-12-31 19:10:00,3758,20180101,10,2,-79.09,35.97,-4.3,0.0,0,...,C,0,22,0,0.265,4.4,1217,0,1.51,0
2,2017-12-31 19:15:00,3758,20180101,15,2,-79.09,35.97,-4.3,0.0,0,...,C,0,22,0,0.265,4.3,1211,0,1.56,0
3,2017-12-31 19:20:00,3758,20180101,20,2,-79.09,35.97,-4.3,0.0,0,...,C,0,22,0,0.265,4.3,1217,0,2.07,0
4,2017-12-31 19:25:00,3758,20180101,25,2,-79.09,35.97,-4.3,0.0,0,...,C,0,22,0,0.265,4.3,1217,0,1.86,0


Since these samples occur about every 5 minutes, we'll change the index to be the datetime. This will allow for some other functionality.

In [8]:
noaa_data.set_index('LST_DATE_LST_TIME',inplace=True)

In [9]:
noaa_data.head()

Unnamed: 0_level_0,WBANNO,UTC_DATE,UTC_TIME,CRX_VN,LONGITUDE,LATITUDE,AIR_TEMPERATURE,PRECIPITATION,SOLAR_RADIATION,SR_FLAG,...,ST_TYPE,ST_FLAG,RELATIVE_HUMIDITY,RH_FLAG,SOIL_MOISTURE_5,SOIL_TEMPERATURE_5,WETNESS,WET_FLAG,WIND_1_5,WIND_FLAG
LST_DATE_LST_TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-12-31 19:05:00,3758,20180101,5,2,-79.09,35.97,-4.3,0.0,0,0,...,C,0,22,0,0.265,4.4,1217,0,1.76,0
2017-12-31 19:10:00,3758,20180101,10,2,-79.09,35.97,-4.3,0.0,0,0,...,C,0,22,0,0.265,4.4,1217,0,1.51,0
2017-12-31 19:15:00,3758,20180101,15,2,-79.09,35.97,-4.3,0.0,0,0,...,C,0,22,0,0.265,4.3,1211,0,1.56,0
2017-12-31 19:20:00,3758,20180101,20,2,-79.09,35.97,-4.3,0.0,0,0,...,C,0,22,0,0.265,4.3,1217,0,2.07,0
2017-12-31 19:25:00,3758,20180101,25,2,-79.09,35.97,-4.3,0.0,0,0,...,C,0,22,0,0.265,4.3,1217,0,1.86,0


Using the `.groupby()` method, and an aggregate function, we can start to see some grouped data. We're also introducing `pd.Grouper`, a method for grouping by characteristics such as time. In this case, Pandas will take the data separated by 5 minute increments, group by some time frequency, and then apply a function to that group. Here we're grouping by week and getting the average air temperature for that week:

In [10]:
noaa_data.loc[:,'AIR_TEMPERATURE'].groupby(pd.Grouper(freq='W')).mean()

LST_DATE_LST_TIME
2017-12-31     -4.528814
2018-01-07     -7.237847
2018-01-14      6.990625
2018-01-21      0.747569
2018-01-28      8.395089
2018-02-04      2.817708
2018-02-11      8.598810
2018-02-18     11.144494
2018-02-25     16.084524
2018-03-04      8.591121
2018-03-11      4.532242
2018-03-18      6.572669
2018-03-25      5.911409
2018-04-01     12.809970
2018-04-08     12.107837
2018-04-15     14.936558
2018-04-22     12.446776
2018-04-29    -44.525942
2018-05-06     19.035268
2018-05-13     20.755060
2018-05-20     22.814236
2018-05-27     22.898760
2018-06-03   -214.974206
2018-06-10     22.816915
2018-06-17     23.292510
2018-06-24     26.877282
2018-07-01     24.928472
2018-07-08     24.993353
2018-07-15     24.147272
2018-07-22     24.144544
2018-07-29     24.514484
2018-08-05     19.234970
2018-08-12     26.266321
Freq: W-SUN, Name: AIR_TEMPERATURE, dtype: float64

Something is going on in April, May, or June. Is it really that close to absolute zero at the end of spring?

In [11]:
noaa_data[(noaa_data.index >= '2018-05-27') & (noaa_data.index < '2018-06-03')].loc[:,'AIR_TEMPERATURE']

LST_DATE_LST_TIME
2018-05-27 00:00:00    21.6
2018-05-27 00:05:00    21.5
2018-05-27 00:10:00    21.4
2018-05-27 00:15:00    21.5
2018-05-27 00:20:00    21.5
2018-05-27 00:25:00    21.5
2018-05-27 00:30:00    21.5
2018-05-27 00:35:00    21.4
2018-05-27 00:40:00    21.5
2018-05-27 00:45:00    21.6
2018-05-27 00:50:00    21.6
2018-05-27 00:55:00    21.6
2018-05-27 01:00:00    21.6
2018-05-27 01:05:00    21.6
2018-05-27 01:10:00    21.7
2018-05-27 01:15:00    21.7
2018-05-27 01:20:00    21.6
2018-05-27 01:25:00    21.6
2018-05-27 01:30:00    21.6
2018-05-27 01:35:00    21.5
2018-05-27 01:40:00    21.5
2018-05-27 01:45:00    21.4
2018-05-27 01:50:00    21.4
2018-05-27 01:55:00    21.4
2018-05-27 02:00:00    21.3
2018-05-27 02:05:00    21.3
2018-05-27 02:10:00    21.4
2018-05-27 02:15:00    21.3
2018-05-27 02:20:00    21.3
2018-05-27 02:25:00    21.2
                       ... 
2018-06-02 21:30:00    20.9
2018-06-02 21:35:00    21.0
2018-06-02 21:40:00    20.9
2018-06-02 21:45:00    20.8
20

Just looking at the list of data, we're not seeing anything out of the ordinary. Let's mask the data for values less than absolute zero, when applied to air temperature, and use `np.unique()` to get the date(s) associated with that.

In [12]:
np.unique(noaa_data[noaa_data['AIR_TEMPERATURE'] < -273.15]['AIR_TEMPERATURE'].index.date)

array([datetime.date(2018, 4, 26), datetime.date(2018, 5, 29),
       datetime.date(2018, 8, 5)], dtype=object)

Three days in 2018 had air temperatures lower than absolute zero. I think that would have made news. Let's look at a histogram to see what our distribution is for temperatures like that.

In [13]:
np.histogram(noaa_data[(noaa_data.index >= '2018-05-29') & (noaa_data.index < '2018-05-30')].loc[:,'AIR_TEMPERATURE'])

(array([ 48,   0,   0,   0,   0,   0,   0,   0,   0, 240]),
 array([-9999.  , -8996.41, -7993.82, -6991.23, -5988.64, -4986.05,
        -3983.46, -2980.87, -1978.28,  -975.69,    26.9 ]))

And while we're at it, let's transition to our favorite thing: indexers!

In [14]:
np.histogram(noaa_data.loc["2018-05-29":"2018-05-29","AIR_TEMPERATURE"])

(array([ 48,   0,   0,   0,   0,   0,   0,   0,   0, 240]),
 array([-9999.  , -8996.41, -7993.82, -6991.23, -5988.64, -4986.05,
        -3983.46, -2980.87, -1978.28,  -975.69,    26.9 ]))

Looking at these bins, we can see a really weird distribution. Most of the data is in the rightmost bin, with temperatures being at or below 26.9 degrees celcius. But there are a lot of -9999 values. We know this to be incorrect data. In fact, this is indicated in the notes of our specification document:
* C.  Missing data are indicated by the lowest possible integer for a given column format, such as -9999.0 for 7-character fields with one decimal place or -99.000 for 7-character fields with three decimal places.

We don't always have specifications for errors, so it's good to have a couple of ways to look at where some outliers might make our data messy.

In [16]:
noaa_data['AIR_TEMPERATURE'].replace(-9999,np.nan,inplace=True)

Here we'll replace the invalid data with `np.nan`. Even though `np.nan` is invalid data, it give some indication to functions that it should be omitted. Check out this function:

In [17]:
# noaa_data.mean?

By default, `.mean()` skips null/nan values. Look at how the following three examples work:

In [None]:
pd.Series([1,2,3]).mean()

In [None]:
pd.Series([1,2,3,np.nan]).mean()

In [None]:
pd.Series([1,2,3,np.nan]).mean(skipna=False)

Now, let's reapply based upon our fixed data. In theory, the values should be more in accordance with our expectations. We're dropping `np.nan` from our calculated mean.

In [18]:
noaa_data.loc[:,'AIR_TEMPERATURE'].groupby(pd.Grouper(freq='W')).mean()

LST_DATE_LST_TIME
2017-12-31    -4.528814
2018-01-07    -7.237847
2018-01-14     6.990625
2018-01-21     0.747569
2018-01-28     8.395089
2018-02-04     2.817708
2018-02-11     8.598810
2018-02-18    11.144494
2018-02-25    16.084524
2018-03-04     8.591121
2018-03-11     4.532242
2018-03-18     6.572669
2018-03-25     5.911409
2018-04-01    12.809970
2018-04-08    12.107837
2018-04-15    14.936558
2018-04-22    12.446776
2018-04-29    15.081687
2018-05-06    19.035268
2018-05-13    20.755060
2018-05-20    22.814236
2018-05-27    22.898760
2018-06-03    23.660569
2018-06-10    22.816915
2018-06-17    23.292510
2018-06-24    26.877282
2018-07-01    24.928472
2018-07-08    24.993353
2018-07-15    24.147272
2018-07-22    24.144544
2018-07-29    24.514484
2018-08-05    24.206799
2018-08-12    26.266321
Freq: W-SUN, Name: AIR_TEMPERATURE, dtype: float64

But be careful, now that we have `np.nan` in our data, the histogram from above might be broken.

In [19]:
np.histogram(noaa_data.loc["2018-05-29":"2018-05-29","AIR_TEMPERATURE"])

  return umr_minimum(a, axis, None, out, keepdims)
  return umr_maximum(a, axis, None, out, keepdims)


ValueError: range parameter must be finite.

Check your version of NumPy:

In [20]:
np.__version__

'1.14.3'

In version 1.15, "histogram will accept NaN values when explicit bins are given."

One benefit of using the datetime values as an index, is that we can groupby properties of those dates. Ever need to look at data grouped by hour of the day? What was the average temperature for each hour in the month of July?

In [28]:
noaa_month = noaa_data.loc["2018-07-01":"2018-08-01"]
noaa_month.loc[:,'AIR_TEMPERATURE'].groupby(noaa_month.index.hour).mean()

LST_DATE_LST_TIME
0     21.434115
1     21.044531
2     20.735938
3     20.448437
4     20.299479
5     20.227083
6     21.211198
7     22.832292
8     24.351042
9     25.720573
10    26.765625
11    27.762500
12    28.766927
13    29.477344
14    29.496094
15    29.222396
16    29.065104
17    28.552865
18    26.458073
19    24.436979
20    23.342708
21    22.694010
22    22.391406
23    21.930469
Name: AIR_TEMPERATURE, dtype: float64

What is the hottest average day of the year (so far)? (0 = Monday)

In [53]:
noaa_data['AIR_TEMPERATURE'].groupby(noaa_data.index.dayofweek).mean()

LST_DATE_LST_TIME
0    14.295812
1    13.928773
2    15.074754
3    15.834376
4    15.986246
5    15.256877
6    15.738237
Name: AIR_TEMPERATURE, dtype: float64

In [None]:
noaa_data.loc[:,'PRECIPITATION'].groupby(pd.Grouper(freq='W')).sum()