Weather Data

In [None]:
import numpy as np
import pandas as pd

In [None]:
#Defining variables. hourlyColumns may be altered later, but this is what we are using for now

In [None]:
hourlyColumns = ['DATE',
'REPORTTPYE',
'HOURLYSKYCONDITIONS',
'HOURLYVISIBILITY',
'HOURLYPRSENTWEATHERTYPE',
'HOURLYWETBULBTEMPF',
'HOURLYDRYBULBTEMPF',
'HOURLYDewPointTempF',
'HOURLYRelativeHumidity',
'HOURLYWindSpeed',
'HOURLYWindDirection',
'HOURLYWindGustSpeed',
'HOURLYStationPressure',
'HOURLYPressureTendency',
'HOURLYPressureChange',
'HOURLYSeaLevelPressure',
'HOURLYPrecip',
'HOURLYAltimeterSetting']

In [None]:
#Defining functions - all together so we can see them and know what we have to work with
# without scrolling through entire program

In [None]:
def getAllWeatherData():
    return pd.read_csv(r'../data/1052640.csv',low_memory=False)

In [None]:
def getHourlyWeatherData():
    return pd.read_csv(r'../data/1052640.csv',usecols = hourlyColumns, low_memory = False)


In [None]:
def displayWeatherData(array):
    print array.columns.values


In [None]:
#delaring variables from the functions - don't need to know exactly what is in them to use them
gsoDataAll = getAllWeatherData()
gsoDataHours = getHourlyWeatherData()

In [None]:
#How to use the display method
displayWeatherData(gsoDataAll)
displayWeatherData(gsoDataHours)

This is just a smaller subset of the columns. Daily and Monthly rollups were ignored. Fahrenheit temps used instead of Celcius.

In [None]:

gsoData = getHourlyWeatherData()

Verifying the columns.

In [None]:

gsoData.info()

The spelling here is frustrating.

In [None]:

gsoData.rename(columns = {'REPORTTPYE':'REPORTTYPE'}, inplace=True)

These seem to be start of day values:

In [None]:

gsoData[gsoData.REPORTTYPE == 'SOD']
    

Dropping **S**tart **O**f **D**ay

In [None]:
gsoDataHourly = gsoData[gsoData.REPORTTYPE != 'SOD']

In [None]:
gsoDataHourly.REPORTTYPE.unique()

In [None]:
gsoDataHourly.set_index(gsoDataHourly['DATE'].apply(pd.to_datetime),inplace=True)

In [None]:
gsoDataHourly = gsoDataHourly.drop('DATE',axis=1)

In [None]:
gsoDataHourly.info()

### Issues with invalid data in a float column:
Here's some testing for invalid information in the `HOURLYDRYBULBTEMPF` to see how it behaves. First, let's look at `HOURLYDRYBULBTEMPF`.

In [57]:
gsoDataHourly.HOURLYDRYBULBTEMPF

DATE
2017-01-01 00:54:00    44
2017-01-01 01:54:00    44
2017-01-01 02:54:00    44
2017-01-01 03:54:00    45
2017-01-01 04:54:00    41
2017-01-01 05:54:00    41
2017-01-01 06:52:00    41
2017-01-01 06:54:00    41
2017-01-01 07:13:00    41
2017-01-01 07:52:00    41
2017-01-01 07:54:00    41
2017-01-01 08:54:00    42
2017-01-01 09:54:00    44
2017-01-01 10:34:00    45
2017-01-01 10:54:00    46
2017-01-01 11:54:00    48
2017-01-01 12:35:00    48
2017-01-01 12:54:00    49
2017-01-01 13:31:00    48
2017-01-01 13:34:00    48
2017-01-01 13:54:00    48
2017-01-01 13:58:00    48
2017-01-01 14:23:00    47
2017-01-01 14:52:00    48
2017-01-01 14:54:00    48
2017-01-01 15:04:00    48
2017-01-01 15:45:00    48
2017-01-01 15:54:00    48
2017-01-01 16:35:00    48
2017-01-01 16:54:00    48
                       ..
2017-08-21 23:54:00    73
2017-08-22 00:54:00    72
2017-08-22 01:00:00    72
2017-08-22 01:54:00    71
2017-08-22 02:54:00    72
2017-08-22 03:54:00    70
2017-08-22 04:00:00    70
2017-08

It appears that there are 8677 rows, but they're objects. We can convert these to numeric.

In [58]:
gsoDataHourly.HOURLYDRYBULBTEMPF.apply(pd.to_numeric) #gives an error, as some of them are not numbers.

ValueError: Unable to parse string "65s" at position 0

Using `apply(pd.to_numeric,errors='coerce')` converts the non-numeric data to NaN.

In [59]:
dryBulbAsFloat = gsoDataHourly.HOURLYDRYBULBTEMPF.apply(pd.to_numeric,errors='coerce')

These are the values that are np.NaN.

In [60]:
dryBulbAsFloat[dryBulbAsFloat.isnull()]

DATE
2017-05-19 15:37:00   NaN
2017-05-19 15:54:00   NaN
2017-05-19 16:00:00   NaN
2017-05-19 16:30:00   NaN
2017-07-23 17:01:00   NaN
2017-07-23 17:45:00   NaN
2017-07-23 17:54:00   NaN
Name: HOURLYDRYBULBTEMPF, dtype: float64

Using the list above as a boolean mask on the `gsoDataHourly` DataFrame, we can see which values have been converted:

In [61]:
gsoDataHourly.HOURLYDRYBULBTEMPF[dryBulbAsFloat.isnull()]

DATE
2017-05-19 15:37:00    65s
2017-05-19 15:54:00    66s
2017-05-19 16:00:00    66s
2017-05-19 16:30:00    66s
2017-07-23 17:01:00    71s
2017-07-23 17:45:00    71s
2017-07-23 17:54:00    71s
Name: HOURLYDRYBULBTEMPF, dtype: object

It appears that all of the invalid data in this chart is on two days. Looking at a range on 5/19/17 that encapsulates all of the data for that day for a more complete picture:

In [62]:
gsoDataHourly[(gsoDataHourly.index > pd.to_datetime("2017-05-19 15:00:00")) &\
              (gsoDataHourly.index < pd.to_datetime("2017-05-19 17:00:00"))].loc[:,'HOURLYDRYBULBTEMPF']

DATE
2017-05-19 15:23:00     73
2017-05-19 15:37:00    65s
2017-05-19 15:54:00    66s
2017-05-19 16:00:00    66s
2017-05-19 16:30:00    66s
2017-05-19 16:54:00     68
Name: HOURLYDRYBULBTEMPF, dtype: object

In this case the four values with an 's' in them would be converted to 'NaN'. Using `.ffill()` afterward would convert them to 73. If the temperatures are basically only the real part of the number, then we're talking about a 7 or 8 degree swing.

Doing the same for 7/23/17:

In [63]:
gsoDataHourly[(gsoDataHourly.index > pd.to_datetime("2017-07-23 16:00:00")) &\
              (gsoDataHourly.index < pd.to_datetime("2017-07-23 19:00:00"))].loc[:,'HOURLYDRYBULBTEMPF']

DATE
2017-07-23 16:30:00     74
2017-07-23 16:43:00     73
2017-07-23 16:48:00     72
2017-07-23 16:54:00     72
2017-07-23 17:01:00    71s
2017-07-23 17:45:00    71s
2017-07-23 17:54:00    71s
2017-07-23 18:54:00     73
Name: HOURLYDRYBULBTEMPF, dtype: object

Here, `.ffill()` would be perhaps less worrisome. Converting a 71s to a 72 would be less damaging to the data, but it's still changing data none the less.

ehhh... just for the heck of it...

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
dateTime = gsoDataHourly.index.values
tempWetBulbInF = gsoDataHourly.loc[:,'HOURLYWETBULBTEMPF'].values
tempDryBulbInF = pd.to_numeric(gsoDataHourly.loc[:,'HOURLYDRYBULBTEMPF'].values,errors='coerce')



temp_chart = plt.figure(figsize=(16,8))
temp1 = plt.plot(dateTime, tempWetBulbInF, label= 'Wet Bulb (F)')
temp2 = plt.plot(dateTime, tempDryBulbInF, label= 'Dry Bulb (F)')

plt.title('HourlyBulbTemp in degrees F')

plt.ylabel('Temp')
plt.xlabel('time')
plt.legend(loc=0);