# Investigation of NOAA's wind data
[The National Oceanic and Atmospheric Administration](https://www.noaa.gov) maintains a series of weather stations called [Automated Surface Observation Systems](http://www.hurricanescience.org/science/observation/landbased/automatedsurfaceobssystems/) (ASOS). They offer one-minute and five-minute interval data at these FTP sites:
* ftp://ftp.ncdc.noaa.gov/pub/data/asos-onemin/  
* ftp://ftp.ncdc.noaa.gov/pub/data/asos-fivemin/  

The structure of the five-minute interval data is [explained in this pdf](ftp://ftp.ncdc.noaa.gov/pub/data/documentlibrary/tddoc/td6401b.pdf) from NOAA. 


### Libraries and installs

In [None]:
!python -m pip install --upgrade pip

In [None]:
!pip install gmplot

In [24]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt 
import json
import datetime
import gmplot 
import re

### Get sample data

Grab data from one station for one month. 

In [5]:
filepath = "ftp://ftp.ncdc.noaa.gov/pub/data/asos-fivemin/6401-2019/64010K0J4201908.dat"
lines = [] # an array of each read line
for line in pd.read_csv(filepath_or_buffer=filepath , encoding='utf-8', header=None, chunksize=1):
    lines.append(line.iloc[0,0])

In [39]:
lines[0]

'63870K0J4 0J420190801000011108/01/19 00:00:31  5-MIN K0J4 010500Z AUTO 00000KT 10SM FEW070 23/20 A3006 190 81 1200 000/00 RMK AO2 T02330200'

In [138]:
# split lines and data chunks
data = [] # an array of arrays, inner arrays are all data for one record, outer array is all records
for line in lines:

    # reset any variables if needed
    record = [] 
    Wind_Data = False 
    Variable_Winds = False
    Gusts = False
    Wind_Direction = ''
    Wind_Speed = ''
    Gust_Speed = ''
    Variable_Wind_Info = ''

    line = line.split() # take string of one record's data and split into space separated chunks
    WBAN_Number = line[0][0:5] # The WBAN (Weather Bureau, Army, Navy) number is a unique 5-digit number
    Call_Sign = line[0][5:] # The call sign is a location identifier, three or four characters in length 
    suffix = line[1][-2:] # grab the last two digits that are the year (i.e. 19 for 2019)
    Year = '20'+suffix # in YYYY format
    CallSign_Date = re.split(Year, line[1])
    Call_Sign2 = CallSign_Date[0] # this seems to be the same as Call_Sign but without initial letter
    Date = CallSign_Date[1]
    Month = Date[0:2] # in MM format
    Day = Date[2:4] # in DD format
    Hour = Date[4:6] # in HH format
    Minute = Date[6:8] # Observations are recorded on whole five-minute increments (i.e. 00,05,10,...,50,55)
    Mystery_Prefix = Date[8:11] # I'm not sure what this is yet
    Date = Date[11:] # MM/DD/YY format
    Timestamp = line[2] # in HH:MM:SS format
    Interval = line[3] # should be 5-MIN as opposed to 1-MIN
    Call_Sign3 = line[4] # for some reason, a THIRD output of the call sign. random.
    Zulu_Time = line[5] # Zulu Time, or military time, or UTC
    Report_Modifier = line[6] # AUTO for fully automated report, COR for correction to a previously disseminated report
    
    # after this point, data could be missing/optional
    try:
        Next_Data = line[7]
        if "KT" in Next_Data:
            Wind_Data = True
            Wind_Direction = Next_Data[0:3] # in tens of degrees from true north
            if Wind_Direction == 'VRB':
                Variable_Winds = True
            Wind_Speed = Next_Data[3:5] # in whole knots (two digits)
            if Next_Data[5] == 'G':
                Gusts = True
                Gust_Speed = Next_Data[6:8] # speed in whole knots (two digits)
        else:
            Wind_Data = False
    except:
        print("OUT OF DATA AT FIELD 7")
        print(line)
        
    try:
        Next_Data = line[8]
        if Wind_Data:
            if (re.fullmatch(r'[0-9][0-9][0-9]V[0-9][0-9][0-9]', Next_Data)): #e.g. 180V240 = wind direction varies from 180 to 240 degrees
                Variable_Wind_Info = Next_Data
                Variable_Winds = True
    except:
        print("OUT OF DATA AT FIELD 8")
        print(line)
    
    #Sea_Level_Pressure = line[13] # given in tenths of hectopascals (millibars). The last digits are recorded (125 means 1012.5)
    #Station_Type = line[18]
    Num_Fields = len(line)
    record = [WBAN_Number, Call_Sign, Call_Sign2, Year, Month, Day, Hour, Minute, Mystery_Prefix, Date, Timestamp, Interval, Call_Sign3, Zulu_Time, Report_Modifier, Wind_Data, Wind_Direction, Wind_Speed, Gusts, Gust_Speed, Variable_Winds, Variable_Wind_Info, Num_Fields]
    col_names = ["WBAN_Number", "Call_Sign", "Call_Sign2", "Year", "Month", "Day", "Hour", "Minute", "Mystery_Prefix", "Date", "Timestamp", "Interval", "Call_Sign3", "Zulu_Time", "Report_Modifier", "Wind_Data", "Wind_Direction", "Wind_Speed", "Gusts", "Gust_Speed", "Variable_Winds", "Variable_Wind_Info", "Num_Fields"]
    data.append(record)
sample_df = pd.DataFrame(data, columns = col_names)

OUT OF DATA AT SECOND FIELD
['63870K0J4', '0J420190821083504908/21/19', '08:35:31', '5-MIN', 'K0J4', '211335Z', 'AUTO', 'RVRNO']


### Initial look at the data


In [141]:
sample_df.head()

Unnamed: 0,WBAN_Number,Call_Sign,Call_Sign2,Year,Month,Day,Hour,Minute,Mystery_Prefix,Date,...,Zulu_Time,Report_Modifier,Wind_Data,Wind_Direction,Wind_Speed,Gusts,Gust_Speed,Variable_Winds,Variable_Wind_Info,Num_Fields
0,63870,K0J4,0J4,2019,8,1,0,0,111,08/01/19,...,010500Z,AUTO,True,0,0,False,,False,,19
1,63870,K0J4,0J4,2019,8,1,0,5,111,08/01/19,...,010505Z,AUTO,True,0,0,False,,False,,19
2,63870,K0J4,0J4,2019,8,1,0,10,111,08/01/19,...,010510Z,AUTO,True,0,0,False,,False,,19
3,63870,K0J4,0J4,2019,8,1,0,15,111,08/01/19,...,010515Z,AUTO,True,0,0,False,,False,,19
4,63870,K0J4,0J4,2019,8,1,0,20,111,08/01/19,...,010520Z,AUTO,True,0,0,False,,False,,19


In [62]:
sample_df.describe()

Unnamed: 0,Num_Fields
count,6154.0
mean,20.156484
std,1.501627
min,8.0
25%,19.0
50%,20.0
75%,20.0
max,31.0


In [139]:
sample_df

Unnamed: 0,WBAN_Number,Call_Sign,Call_Sign2,Year,Month,Day,Hour,Minute,Mystery_Prefix,Date,...,Zulu_Time,Report_Modifier,Wind_Data,Wind_Direction,Wind_Speed,Gusts,Gust_Speed,Variable_Winds,Variable_Wind_Info,Num_Fields
0,63870,K0J4,0J4,2019,08,01,00,00,111,08/01/19,...,010500Z,AUTO,True,000,00,False,,False,,19
1,63870,K0J4,0J4,2019,08,01,00,05,111,08/01/19,...,010505Z,AUTO,True,000,00,False,,False,,19
2,63870,K0J4,0J4,2019,08,01,00,10,111,08/01/19,...,010510Z,AUTO,True,000,00,False,,False,,19
3,63870,K0J4,0J4,2019,08,01,00,15,111,08/01/19,...,010515Z,AUTO,True,000,00,False,,False,,19
4,63870,K0J4,0J4,2019,08,01,00,20,111,08/01/19,...,010520Z,AUTO,True,000,00,False,,False,,19
5,63870,K0J4,0J4,2019,08,01,00,25,111,08/01/19,...,010525Z,AUTO,True,000,00,False,,False,,19
6,63870,K0J4,0J4,2019,08,01,00,30,111,08/01/19,...,010530Z,AUTO,True,000,00,False,,False,,19
7,63870,K0J4,0J4,2019,08,01,00,35,111,08/01/19,...,010535Z,AUTO,True,000,00,False,,False,,19
8,63870,K0J4,0J4,2019,08,01,00,40,111,08/01/19,...,010540Z,AUTO,True,000,00,False,,False,,19
9,63870,K0J4,0J4,2019,08,01,00,45,111,08/01/19,...,010545Z,AUTO,True,000,00,False,,False,,19


In [96]:
gusty = sample_df[sample_df['Gusts'] == True]
print("There are",len(gusty),"records with gust data.")

There are 59 records with gust data.


In [140]:
variable = sample_df[sample_df['Variable_Winds'] == True]
variable

Unnamed: 0,WBAN_Number,Call_Sign,Call_Sign2,Year,Month,Day,Hour,Minute,Mystery_Prefix,Date,...,Zulu_Time,Report_Modifier,Wind_Data,Wind_Direction,Wind_Speed,Gusts,Gust_Speed,Variable_Winds,Variable_Wind_Info,Num_Fields
75,63870,K0J4,0J4,2019,08,02,02,50,110,08/02/19,...,020750Z,AUTO,True,VRB,03,False,,True,,20
76,63870,K0J4,0J4,2019,08,02,02,55,117,08/02/19,...,020755Z,AUTO,True,VRB,03,False,,True,,21
126,63870,K0J4,0J4,2019,08,02,07,05,113,08/02/19,...,021205Z,AUTO,True,VRB,03,False,,True,,20
131,63870,K0J4,0J4,2019,08,02,07,30,113,08/02/19,...,021230Z,AUTO,True,VRB,03,False,,True,,20
135,63870,K0J4,0J4,2019,08,02,07,50,113,08/02/19,...,021250Z,AUTO,True,VRB,03,False,,True,,20
137,63870,K0J4,0J4,2019,08,02,08,00,110,08/02/19,...,021300Z,AUTO,True,VRB,03,False,,True,,20
142,63870,K0J4,0J4,2019,08,02,08,25,127,08/02/19,...,021325Z,AUTO,True,VRB,03,False,,True,,22
153,63870,K0J4,0J4,2019,08,02,09,20,110,08/02/19,...,021420Z,AUTO,True,VRB,03,False,,True,,20
155,63870,K0J4,0J4,2019,08,02,09,30,110,08/02/19,...,021430Z,AUTO,True,VRB,03,False,,True,,20
159,63870,K0J4,0J4,2019,08,02,09,50,110,08/02/19,...,021450Z,AUTO,True,VRB,03,False,,True,,20


In [134]:
nowinddata = sample_df[sample_df['Wind_Data'] == False]
missing_wind = 0
for num in list(nowinddata.index):
    print(lines[num])
    missing_wind += 1

63870K0J4 0J420190802153012608/02/19 15:30:31  5-MIN K0J4 022030Z AUTO 10SM TS FEW034 SCT043 BKN050 28/20 A2998 260 60 1900 M /M RMK AO2 TSB02 T02830200 $
63870K0J4 0J420190802154012608/02/19 15:40:31  5-MIN K0J4 022040Z AUTO 10SM TS SCT034 BKN043 BKN050 27/19 A2998 260 60 1800 M /M RMK AO2 TSB02 T02720189 $
63870K0J4 0J420190804134509808/04/19 13:45:31  5-MIN K0J4 041845Z AUTO 10SM CLR 32/20 A2998 260 48 2300 M /M RMK AO2 T03220200
63870K0J4 0J420190805113511508/05/19 11:35:31  5-MIN K0J4 051635Z AUTO 10SM FEW026 FEW032 BKN100 29/22 A3003 210 64 1900 M /M RMK AO2 T02890217
63870K0J4 0J420190805125513908/05/19 12:55:31  5-MIN K0J4 051755Z AUTO 10SM FEW029 SCT050 31/22 A3000 240 56 2200 M /M RMK AO2 SLP155 60000 T03110217 10317 20233 50005
63870K0J4 0J420190805131510108/05/19 13:15:31  5-MIN K0J4 051815Z AUTO 10SM FEW030 31/21 A3000 250 56 2100 M /M RMK AO2 T03060211
63870K0J4 0J420190805134010808/05/19 13:40:31  5-MIN K0J4 051840Z AUTO 10SM FEW039 FEW090 31/21 A2998 260 54 2100 M /M RM

In [137]:
print("There are",missing_wind,"records out of", len(lines), "without wind data.")

There are 44 records out of 6154 without wind data.


In [125]:
short = sample_df[sample_df['Num_Fields'] < 19]

In [64]:
short

Unnamed: 0,WBAN_Number,Call_Sign,Call_Sign2,Year,Month,Day,Hour,Minute,Date,Timestamp,Interval,Call_Sign3,Num_Fields
1400,63870,K0J4,0J4,2019,8,7,0,25,07508/07/19,00:25:31,5-MIN,K0J4,13
1569,63870,K0J4,0J4,2019,8,8,6,50,10108/08/19,06:50:31,5-MIN,K0J4,18
2363,63870,K0J4,0J4,2019,8,11,9,5,08008/11/19,09:05:31,5-MIN,K0J4,17
2364,63870,K0J4,0J4,2019,8,11,9,15,09908/11/19,09:15:31,5-MIN,K0J4,17
2365,63870,K0J4,0J4,2019,8,11,9,20,10408/11/19,09:20:31,5-MIN,K0J4,18
2366,63870,K0J4,0J4,2019,8,11,9,25,10408/11/19,09:25:31,5-MIN,K0J4,18
2367,63870,K0J4,0J4,2019,8,11,9,30,10408/11/19,09:30:31,5-MIN,K0J4,18
2368,63870,K0J4,0J4,2019,8,11,9,35,10408/11/19,09:35:31,5-MIN,K0J4,18
4235,63870,K0J4,0J4,2019,8,21,8,35,04908/21/19,08:35:31,5-MIN,K0J4,8
4906,63870,K0J4,0J4,2019,8,24,2,55,10108/24/19,02:55:31,5-MIN,K0J4,18


### Merge lat long data for stations
Takes a few minutes to download the station file

In [156]:
filepath = "ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.txt"
stations = [] # an array of each read line
for station in pd.read_csv(filepath_or_buffer=filepath , encoding='utf-8', chunksize=1):
    stations.append(station.iloc[0,0])

In [159]:
meta_data = stations[0:16]
station_cols = stations[16]
meta_data

[' USAF = Air Force station ID. May contain a letter in the first position.',
 ' WBAN = NCDC WBAN number',
 ' CTRY = FIPS country ID',
 '   ST = State for US stations',
 ' ICAO = ICAO ID',
 '  LAT = Latitude in thousandths of decimal degrees',
 '  LON = Longitude in thousandths of decimal degrees',
 ' ELEV = Elevation in meters',
 'BEGIN = Beginning Period Of Record (YYYYMMDD). There may be reporting gaps within the P.O.R.',
 '  END = Ending Period Of Record (YYYYMMDD). There may be reporting gaps within the P.O.R.',
 'Notes:',
 '- Missing station name',
 '- The term "bogus" indicates that the station name',
 '- For a small % of the station entries in this list',
 '  available. To determine data availability for each location',
 "  'isd-inventory.txt' or 'isd-inventory.csv' file. "]

In [201]:
station_cols = station_cols.split()

In [202]:
station_cols

['USAF',
 'WBAN',
 'STATION',
 'NAME',
 'CTRY',
 'ST',
 'CALL',
 'LAT',
 'LON',
 'ELEV(M)',
 'BEGIN',
 'END']

In [208]:
station_cols = ['USAF','WBAN_Number','Descriptor', 'Lat','Lon','Elev(m)','Begin_Date','End_date']

In [161]:
stations = stations[17:] # remove header (meta data and column names)

In [209]:
station_data = [] # an array of arrays, inner arrays are all data for one record, outer array is all records
for station in stations:
    data_start = 0 # position after awk location data in each record
    USAF = station[0:6]
    WBAN = station[7:12]
    data_start = station.find('+')
    location_string = station[12:data_start]
    #station_code = re.match(r'\w\w\w\w', location_string[-6:])
    #if station_code:
    #    station_code = station_code.group(0)
    rest_of_data = station[data_start:].split()    
    station_record = [USAF, WBAN, location_string] + rest_of_data
    station_data.append(station_record)
station_df = pd.DataFrame(station_data, columns = station_cols)

In [210]:
station_df

Unnamed: 0,USAF,WBAN_Number,Descriptor,Lat,Lon,Elev(m),Begin_Date,End_date
0,007018,99999,WXPOD 7018,+00.000,+000.000,+7018.0,20110309,20130730
1,007026,99999,WXPOD 7026 AF,+00.000,+000.000,+7026.0,20120713,20170822
2,007070,99999,WXPOD 7070 AF,+00.000,+000.000,+7070.0,20140923,20150926
3,008260,99999,WXPOD8270,+00.000,+000.000,+0000.0,20050101,20100731
4,008268,99999,WXPOD8278 AF,+32.950,+065.567,+1156.7,20100519,20120323
5,008307,99999,WXPOD 8318 AF,+00.000,+000.000,+8318.0,20100421,20100421
6,008411,99999,XM20 ...,7,,,,
7,008414,99999,XM18 ...,7,,,,
8,008415,99999,XM21 ...,7,,,,
9,008418,99999,XM24 ...,7,,,,


In [211]:
# get rid of some useless data
station_df = station_df.drop("USAF", axis=1)

In [212]:
merged_df = pd.merge(sample_df, station_df, on='WBAN_Number')

In [213]:
merged_df.head()

Unnamed: 0,WBAN_Number,Call_Sign,Call_Sign2,Year,Month,Day,Hour,Minute,Mystery_Prefix,Date,...,Gust_Speed,Variable_Winds,Variable_Wind_Info,Num_Fields,Descriptor,Lat,Lon,Elev(m),Begin_Date,End_date
0,63870,K0J4,0J4,2019,8,1,0,0,111,08/01/19,...,,False,,19,FLORALA MUNI US AL K0J4,31.043,-86.312,95.7,20060511,20190906
1,63870,K0J4,0J4,2019,8,1,0,5,111,08/01/19,...,,False,,19,FLORALA MUNI US AL K0J4,31.043,-86.312,95.7,20060511,20190906
2,63870,K0J4,0J4,2019,8,1,0,10,111,08/01/19,...,,False,,19,FLORALA MUNI US AL K0J4,31.043,-86.312,95.7,20060511,20190906
3,63870,K0J4,0J4,2019,8,1,0,15,111,08/01/19,...,,False,,19,FLORALA MUNI US AL K0J4,31.043,-86.312,95.7,20060511,20190906
4,63870,K0J4,0J4,2019,8,1,0,20,111,08/01/19,...,,False,,19,FLORALA MUNI US AL K0J4,31.043,-86.312,95.7,20060511,20190906


### Plot where we have wind data

In [222]:
# create lat and long list from the dataframe
# stores output in an html file that you then open
# the google map will be shaded unless you've set up API key (see below)

latitude_list = []
longitude_list = []
for row in range(len(station_df)):
    try:
        lat = station_df.loc[row]['Lat']
        lon = station_df.loc[row]['Lon']
        if (len(lat) < 2) or (len(lon) < 2): 
            pass
        else:
            latitude_list.append(float(lat))
            longitude_list.append(float(lon))
    except:
        pass
    
# scatter points on a google map
gmap3 = gmplot.GoogleMapPlotter(latitude_list[0], longitude_list[0], 13)
gmap3.scatter(latitude_list, longitude_list, '# FF0000', 
                              size = 40, marker = False ) 
  
gmap3.draw("data/wind_map.html")

In [223]:
# Need to fix with an API key.
# https://console.cloud.google.com/google/maps-apis/new?project=composite-drive-193303&folder&organizationId
# how to get an API key:
# https://developers.google.com/maps/documentation/geocoding/get-api-key

gmap3 = gmplot.GoogleMapPlotter(latitude_list[0], longitude_list[0], 13)
gmap3.scatter(latitude_list, longitude_list, '# FF0000', 
                              size = 40, marker = False ) 
gmap3.apikey = "AIzaSyA2TdrwntJVu6IuS_3fOY7WLTLvhl3xntk" # this is Ben's key. replace with your own.
gmap3.draw("data/wind_map.html") 

### Conclusions

* We'll have to figure out how best to match the once-every-5-minute data from ASOS with the variable interval PurpleAir readings. 
* There's great global coverage of wind data, but any given region in the U.S. will only have a few stations. I think that's okay -- PM2.5 can be very localized, but wind isn't all that different in the same region. But that is an assumption to check. 