# Surface Data Preparation (part 1)


  
## **In this notebook, raw data from the Copper Mountain, CO SNOTEL site and Copper Mountain, CO AWOS station is organized, combined and and preprocessed into csv files for future use **



**Step 0: Import necessary modules**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dtb
import os
from glob import glob
import datetime as dt
import warnings

warnings.filterwarnings('ignore')   #suppress warning messages for cleaner presentation


### A.) Data Cleaning of Copper Mountain SNOTEL Data

----

**Step A1: Hourly data files were downloaded on an annual basis.  Thus, there is one file for each year (2005-2017).  The data are comma separated and contain headers.  Each snotel data file row contains hourly or 3-hourly observations of the following parameters:**  
**Field 1. Site_ID:**  NRCS Site Identifier  
**Field 2. Date:**  Date of observation  
**Field 3. Time:** Hour of observation  
**Field 4. WTEQ.I-1 (in):** Recorded Water Equivalent of snow  
**Field 5. PREC.I-1 (in):** Recorded Precipitation  
**Field 6. TOBS.I-1 (DegC):** Recorded Temperature  
**Field 7. SNWD.I (in):** Recorded Snow Depth  Here, the data is read in using the glob funcction**  
**Here, the each data file is read in.  The date and time columns are merged together and made into the index for the dataframe

In [2]:

snotel_files = glob(r'C:\Users\RAPP\Documents\Capstone\data\SNOTEL\415_STAND_YEAR=*.csv')
#print(snotel_files)
snotel_data = [pd.read_csv(f, header=1, parse_dates=[['Date', 'Time']], index_col='Date_Time') for f in snotel_files]

snotel_df= pd.concat(snotel_data)
#print(snotel_df.head())
#print(snotel_df.describe())




**Step A2:  The missing code is for all data is --99.  Set these to Nan values**

In [3]:
#set outliers and missing value to Nan
xx=(snotel_df[:]==-99.9)
snotel_df[xx]=np.NaN

#print(snotel_df.describe())


**Step A3:  The dataset should have a value for every hour, so use asfreq to make sure there is an observation for each hour.  The fill_value will be set to NaN**

In [4]:
snotel_df=snotel_df.asfreq(freq='1H', fill_value=np.NaN)
#print(snotel_df.describe())

**Step A4:  As temperature and snow depth are the only variables of future interest, save the snow depth and temperature columns of this dataframe as a tab delimted file, ready for further analysis**

In [5]:
snotel_df.to_csv('snotel_df.dat',sep = ',', float_format = '%.2f',columns=['TOBS.I-1 (degC) ', 'SNWD.I-1 (in) '])

### B.) Data Cleaning of Copper Mountain, CO ASOS Data 

----

**Step B1: The raw ASOS data was obtained in Integrated Surface Hourly Lite format. A single file with hourly values was downloaded for each year (2006-2017).  According to documentation, this format is fixed format delimtited by whitespace.  From the data documentation, the data has 12 columns:  **

**Field 1: Pos 1-4, Length 4: Observation Year  
Field 2: Pos 6-7, Length 2: Observation Month  
Field 3: Pos 9-11, Length 2: Observation Day  
Field 4: Pos 12-13, Length 2: Observation Hour  
Field 5: Pos 14-19, Length 6:  Air Temperature, Units: deg C scaled by factor of 10  
Field 6: Pos 20-24, Length 6: Dew Point Temperature, Units: deg C scaled by factor of 10  
Field 7: Pos 26-31, Length 6: Sea Level Pressure, Units: hectoPascals scaled by factor of 10  
Field 8: Pos 32-37, Length 6: Wind Direction, Units: angular degrees  
Field 9: Pos 38-43, Length 6: Wind Speed Rate, Units: m/s scaled by factor of 10  
Field 10: Pos 44-49, Length 6: Sky Condition Total Coverage Code  
Field 11: Pos 50-55, Length 6: Liquid Precipitation Depth Dimension - One Hour Duration, Units: mm scaled by factor of 10  
Field 12: Pos 56-61, Length 6: Liquid Precipitation Depth Dimension - Six Hour Duration, Units: mm scaled by factor of 10  **
  
**Below the data files are read in using pd.read_csv. Specific header names appropriate for each column are specified in the header_names list. In addition, the dataframe index is created using the first four date/time columns.**


In [6]:

awos_files = glob(r'C:\Users\RAPP\Documents\Capstone\data\ASOS\722061-03038\722061-03038*\722061-03038*')
print(awos_files)

header_names = ('Year', 'Month', 'Day', 'Hour', 'Temperature', 'Dewpoint', 'Pressure', 'WindDirection', 'WindSpeed', 'CloudCover', '1hr_Precipitation', '6hr_Precipitation')

awos_data = [pd.read_csv(f, delim_whitespace=True, header = None, names = header_names, parse_dates={'Date_Time': ['Year', 'Month', 'Day', 'Hour']}, index_col='Date_Time') for f in awos_files]
#parse_dates={'Date_Time': ['Year', 'Month', 'Day', 'Hour']
raw_awos_df= pd.concat(awos_data)

#print(raw_awos_df.head())


['C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038\\722061-03038-2006\\722061-03038-2006', 'C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038\\722061-03038-2007\\722061-03038-2007', 'C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038\\722061-03038-2008\\722061-03038-2008', 'C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038\\722061-03038-2009\\722061-03038-2009', 'C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038\\722061-03038-2010\\722061-03038-2010', 'C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038\\722061-03038-2011\\722061-03038-2011', 'C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038\\722061-03038-2012\\722061-03038-2012', 'C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038\\722061-03038-2013\\722061-03038-2013', 'C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038\\722061-03038-2014\\722061-03038-2014', 'C:\\Users\\RAPP\\Documents\\Capstone\\data\\ASOS\\722061-03038

**Step B2. Make a copy of the raw dataframe and call it asos_df.  Most of the raw data is in non standard units (e.g. deg C/10) will be loaded and scaled to  more standard units.  Thus column name in the raw dataframe are updated to reflect this.  All missing data filling will be performed in new dataframe**

In [7]:

awos_df = raw_awos_df[:].copy()
#print(awos_df.keys())
#print(awos_df.index)
awos_df.rename(columns={'Temperature': 'Temperature_degC', 'Dewpoint': 'Dewpoint_degC', 'Pressure': 'Pressure_hp' , 'WindSpeed': 'WindSpeed_m/s', 'WindDirection': 'WindDirection_deg', '1hr_Precipitation': '1hr_Precipitation_mm', '6hr_Precipitation': '6hr_Precipitation_mm'}, inplace=True)
#print(awos_df.keys())


**Step B3. The data should contain an observation for each hour. The function asfreq will be to insure that is the case.  Any missing hours will be filled with np.Nan**

In [8]:
awos_df=awos_df.asfreq(freq='1H', fill_value=np.NaN)
#print(asos_df.describe())

**Step B4.  The data documentation states that all missing values are set to -9999.  These will be changed to np.NaN in the asos_df dataframe. **

In [9]:
#set outliers and missing value to Nan
xx= awos_df[:] == -9999.0
awos_df[xx]=np.NaN


#print(awos_df.describe())
#print(raw_awos_df.describe())

**Step B5: According to documentation, many of the variables' data have been scaled by a factor of 10.  To get actual values in more standard units for some variables, the raw data needs to be divided by 10.  
This operation will be saved in the copied dataframe just made.**

In [10]:
awos_df['Temperature_degC'] = awos_df['Temperature_degC']/10
awos_df['Dewpoint_degC'] = awos_df['Dewpoint_degC']/10
awos_df['Pressure_hp'] = awos_df['Pressure_hp']/10
awos_df['WindSpeed_m/s'] = awos_df['WindSpeed_m/s']/10
awos_df['1hr_Precipitation_mm'] = awos_df['1hr_Precipitation_mm']/10
awos_df['6hr_Precipitation_mm'] = awos_df['6hr_Precipitation_mm']/10


print(awos_df.describe())


       Temperature_degC  Dewpoint_degC  Pressure_hp  WindDirection_deg  \
count      82634.000000   82491.000000          0.0       76327.000000   
mean           1.610124      -8.923046          NaN         230.179360   
std            9.293641       8.115253          NaN          73.606778   
min          -28.000000     -45.000000          NaN           0.000000   
25%           -5.000000     -14.000000          NaN         190.000000   
50%            1.000000      -9.000000          NaN         260.000000   
75%            9.000000      -3.000000          NaN         280.000000   
max           24.000000      10.000000          NaN         360.000000   

       WindSpeed_m/s    CloudCover  1hr_Precipitation_mm  6hr_Precipitation_mm  
count   76327.000000  65100.000000           1152.000000                   0.0  
mean        6.047255      3.023717              0.955903                   NaN  
std         3.430585      3.548211              1.232150                   NaN  
min      

**Step B6. Now write the final dataframe to a tab delimited csv file for future use.**

In [11]:
awos_df.to_csv('awos_df.dat',sep = ',', float_format = '%.2f')