### Goal:

Automate data cleaning for remaining four years of buoy data

In [1]:
import gzip
import os
import io
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
#write gzip file to text file, then clean data and convert to csv
#gz = gzip file path name, a string
#txt = text file path name, a string
#csv = csv file path name, a string

def clean_buoy(gz, txt, csv):

    #unzipping
    file = gz
    with gzip.open(file, 'rb') as ip:
            with io.TextIOWrapper(ip, encoding='utf-8') as decoder:
                # Let's read the content using read()
                content = decoder.read()
    with open(txt, 'w') as f:
        f.write(content)
        
    #reading as space delimited file
    df = pd.read_csv(txt, sep='\s+', header=None)

    #all the cleaning
    df = df.iloc[:, 0:8].drop(index=1, columns=6)
    df.columns = df.iloc[0]
    df = df.drop([0], axis=0).rename(columns={"Hs": "Height", "Dp": "Deg"}).reset_index(drop='True')
    
    #getting rid of 30 minute data bc only need hour granularity
    df = df.iloc[::2].drop(columns=['MN'])
    
    #creating a csv
    df.to_csv(csv, index=False)
    
    return
    

In [3]:
root_folder = Path.cwd().parents[1]


In [4]:
clean_buoy(root_folder/'data/raw/cdip2015.gz', root_folder/'data/interim/swell2015.txt', root_folder/'data/interim/00-swell2015.csv')
clean_buoy(root_folder/'data/raw/cdip2016.gz', root_folder/'data/interim/swell2016.txt', root_folder/'data/interim/00-swell2016.csv')
clean_buoy(root_folder/'data/raw/cdip2017.gz', root_folder/'data/interim/swell2017.txt', root_folder/'data/interim/00-swell2017.csv')
clean_buoy(root_folder/'data/raw/cdip2018.gz', root_folder/'data/interim/swell2018.txt', root_folder/'data/interim/00-swell2018.csv')
clean_buoy(root_folder/'data/raw/cdip2019.gz', root_folder/'data/interim/swell2019.txt', root_folder/'data/interim/00-swell2019.csv')
clean_buoy(root_folder/'data/raw/cdip2020.gz', root_folder/'data/interim/swell2020.txt', root_folder/'data/interim/00-swell2020.csv')
clean_buoy(root_folder/'data/raw/cdip2021.gz', root_folder/'data/interim/swell2021.txt', root_folder/'data/interim/00-swell2021.csv')

Success!

The cleaning section involves:
- getting rid of unessecary columns, kept swell direction and swell height
- one row had units, but I can remember that Height is in **meters** and Degrees is in **degrees**
- what should have been the column headers was just a row
- renaming columns
- took only half of the data because only need information at the hour level, not high-stakes enough to do an average or smth over the hour, and since swell change is gradual, (probably) doesn't change a significant amount within a half-hour range. Thus, kept only data entries from the top of the hour, then eliminated the minutes column

### Creating a UTC datetime column

In [5]:
#df is the dataframe to clean
#csv is the file path name, a string

def dtSwell(df, csv):
    
    #making columns in question strings
    df[['YEAR','MO','DY','HR']] = df[['YEAR','MO','DY','HR']].astype(str)
    
    #month formatting
    for i in range(len(df)):
        if len(df['MO'][i])<2:
            df['MO'][i]='0'+df['MO'][i]

    #day formatting
    for i in range(len(df)):
        if len(df['DY'][i])<2:
            df['DY'][i]='0'+df['DY'][i]

    #hour formatting
    for i in range(len(df)):
        if len(df['HR'][i])<2:
            df['HR'][i]='0'+df['HR'][i]
    
    #making a string that the to_datetime function will recognize
    df['UTC'] = df['YEAR']+df['MO']+df['DY']+df['HR']+'00'
    
    #converting 
    df['UTC'] = pd.to_datetime(df['UTC'], utc=True)
    
    #dropping now useless columns and saving to csvs
    df.drop(columns=['YEAR','MO','DY','HR']).to_csv(csv, index=False)
    
    return

In [6]:
s15 = pd.read_csv(root_folder/'data/interim/00-swell2015.csv')
s16 = pd.read_csv(root_folder/'data/interim/00-swell2016.csv')
s17 = pd.read_csv(root_folder/'data/interim/00-swell2017.csv')
s18 = pd.read_csv(root_folder/'data/interim/00-swell2018.csv')
s19 = pd.read_csv(root_folder/'data/interim/00-swell2019.csv')
s20 = pd.read_csv(root_folder/'data/interim/00-swell2020.csv')
s21 = pd.read_csv(root_folder/'data/interim/00-swell2021.csv')

In [7]:
dtSwell(s15, root_folder/'data/interim/01-swell2015.csv')
dtSwell(s16, root_folder/'data/interim/01-swell2016.csv')
dtSwell(s17, root_folder/'data/interim/01-swell2017.csv')
dtSwell(s18, root_folder/'data/interim/01-swell2018.csv')
dtSwell(s19, root_folder/'data/interim/01-swell2019.csv')
dtSwell(s20, root_folder/'data/interim/01-swell2020.csv')
dtSwell(s21, root_folder/'data/interim/01-swell2021.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MO'][i]='0'+df['MO'][i]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['DY'][i]='0'+df['DY'][i]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['HR'][i]='0'+df['HR'][i]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['MO'][i]='0'+df['MO'][i]
A value is trying to be set on a copy of a slice from a 

## New Goal:
append all years together for one dataset

### csv files cannot store datetime objects, so need to convert the UTC column to a datetime object while reading the file to a dataframe. 

In [10]:
date_parser = pd.to_datetime

s15 = pd.read_csv(root_folder/'data/interim/01-swell2015.csv', parse_dates=['UTC'], date_parser=date_parser)
s16 = pd.read_csv(root_folder/'data/interim/01-swell2016.csv', parse_dates=['UTC'], date_parser=date_parser)
s17 = pd.read_csv(root_folder/'data/interim/01-swell2017.csv', parse_dates=['UTC'], date_parser=date_parser)
s18 = pd.read_csv(root_folder/'data/interim/01-swell2018.csv', parse_dates=['UTC'], date_parser=date_parser)
s19 = pd.read_csv(root_folder/'data/interim/01-swell2019.csv', parse_dates=['UTC'], date_parser=date_parser)
s20 = pd.read_csv(root_folder/'data/interim/01-swell2020.csv', parse_dates=['UTC'], date_parser=date_parser)
s21 = pd.read_csv(root_folder/'data/interim/01-swell2021.csv', parse_dates=['UTC'], date_parser=date_parser)

In [11]:
bigs = pd.concat([s15,s16,s17,s18,s19,s20,s21], ignore_index=True)
bigs.shape

(59102, 3)

definitely missing some hours but should be alright

In [12]:
24*365*7+24

61344

In [13]:
#checking to make sure there arn't duplicate dates
date_count = bigs['UTC'].value_counts().to_list()

ones = np.ones(len(date_count))

truth = date_count==ones
truth.sum()==bigs.shape[0]

True

In [14]:
bigs.to_csv(root_folder/'data/interim/00-swell.csv',index=False)