# Initial data cleaning - Year 2017

[Columns explanation](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_hourly_data_files)

[Data source](https://aqs.epa.gov/aqsweb/airdata/download_files.html#Raw)

### Goals:
* identify unimportant columns
* identify best way to shorten data set, find most densly populated states, counties, sites
* create unique identifier for each metering site *SiteCode*
* save to parquet

### TODO:
* save WIND to parquet
* do the same for TEMP, PRESS, RH_DP and PM2.5 FRM/FEM Mass, PM2.5 non FRM/FEM Mass, PM10 Mass, PM2.5 Speciation, PM10 Speciation 

In [68]:
from os.path import join
from glob import glob
from dask import dataframe as dd
from matplotlib import rcParams


from deep_aqi import ROOT


pd.set_option('max_columns', 50)
pd.set_option('max_rows', 25)

In [2]:
RAW_DATA = join(ROOT, 'data', 'raw')
INTERIM_DATA = join(ROOT, 'data', 'interim')

In [3]:
files = glob(f'{RAW_DATA}/*.csv', recursive=True)
files

['/home/filip/projects/deep_aqi/deep_aqi/data/raw/hourly_WIND_2017.csv',
 '/home/filip/projects/deep_aqi/deep_aqi/data/raw/hourly_PRESS_2017.csv',
 '/home/filip/projects/deep_aqi/deep_aqi/data/raw/hourly_88101_2017.csv',
 '/home/filip/projects/deep_aqi/deep_aqi/data/raw/hourly_TEMP_2017.csv']

### WIND

In [4]:
# there was some inconsistency with 'Qualifier' column
data = dd.read_csv(files[0], dtype={'Qualifier': 'object'})

In [31]:
rcParams['figure.figsize'] = [20, 10]

data['State Name'].value_counts().compute().plot(kind='bar')

In [64]:
# Site Num is unique only within county, SiteCode is unique across all counties and states
data['SiteCode'] = data['State Name'] + '_' + data['County Name'] + '_' + data['Site Num'].astype(str)

## Dropping columns
[Columns explanation](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_hourly_data_files)

Columns dropped:
* State Code, County Code - same information as in State Name, County Name
* State Name, County Name, Site Num - now all concatenated into SiteCode
* Parameter Code - same information as in Parameter Name
* Datum - seemingly unimportant
* MDL - seemingly unimportant
* Uncertainty - seemingly unimportant (mostly empty column)
* Qualifier - seemingly unimportant (mostly empty column)
* Method Type - seemingly unimportant
* Method Code - seemingly unimportant
* Date of Last Change - unimportant

In [65]:
keep_cols = ['SiteCode', 'State Name', 'County Name', 'Latitude', 'Longitude', 'Date Local',
             'Time Local', 'Date GMT', 'Time GMT', 'Parameter Name', 'POC', 'Sample Measurement',
             'Units of Measure']

### Explanation for number of occurrences per site

There are 8756 hours in a year
sites with n ~ 17520 have measurements of both wind speed and wind direction
sites with n > 17520 have measurements made by multiple instruments - take average value from all instruments for each hour

In [69]:
site_summary = data.groupby(by=['SiteCode'])['Sample Measurement'].agg('count').compute().sort_values(ascending=False)

SiteCode
Missouri_Jefferson_9008       38528
Texas_Brewster_101            34620
Indiana_Porter_11             29940
Michigan_Oakland_11           17520
Ohio_Franklin_38              17520
Michigan_Washtenaw_8          17520
Ohio_Belmont_6                17520
Michigan_Muskegon_39          17520
California_Riverside_12       17520
Michigan_Wayne_33             17520
Michigan_Macomb_21            17520
Indiana_Vigo_1009             17518
                              ...  
Washington_Grant_1003          4348
Connecticut_Middlesex_9007     4331
California_San Diego_1001      4308
New Hampshire_Grafton_3002     4270
Oregon_Multnomah_2008          4092
California_El Dorado_12        3970
Texas_Freestone_1084           2968
California_Kern_18             2928
Texas_Rusk_1082                2900
Maryland_Baltimore_1007        2841
Virginia_Henrico_14            1606
New Hampshire_Belknap_2006     1488
Name: Sample Measurement, Length: 770, dtype: int64