### Header

In this notebook, we will clean the weather dataset. 

In [None]:
# import libraries

# maths
import numpy as np
import pandas as pd
import time
import datetime as datetime

In [None]:
# file paths

input_path = '../data/2_input/'
clean_path = '../data/3_clean/'
output_path = '../data/4_output/'

image_path = '../images/'

### Functions

Similar to the train and test dataset, we had split the data feature into `year`, `month` and `day` columns. Based on the first few rows of the data, it appears that there are some missing values in the dataset, labelled as `M`, `-` and `T`. Hence, we also created functions to summarise these 'null' values. 

In [None]:
# split dates

def create_yr(x): 
    return x.split('-')[0] 

def create_mth(x): 
    return x.split('-')[1] 

def create_day(x): 
    return x.split('-')[2] 

def rename_columns (columns):
    return [column.lower() for column in columns]

def clean_date(df): 
    df.columns = rename_columns(df.columns)
    df['year'] = df.date.apply(create_yr)
    df['month'] = df.date.apply(create_mth)
    df['day'] = df.date.apply(create_day)
    df.drop('date', axis=1, inplace = True)
    return df

In [None]:
def count_t(x): 
    if x == '  T':
        return 1
    else:
        return 0
    
def count_m(x): 
    if x == 'M':
        return 1
    else:
        return 0

def count_dash(x): 
    if x == '-':
        return 1
    else:
        return 0

# count total number of M - T in df

def print_summary(df):    

    cols = ['column','M','-','T']
    df_summary = pd.DataFrame(columns=cols)
    idx = 0

    for col in df.columns:

        if df[col].dtype == 'object':

            total_m = df[col].apply(count_m).sum()
            total_dash = df[col].apply(count_dash).sum()
            total_t = df[col].apply(count_t).sum()

            df_summary.at[idx,cols[0]] = col
            df_summary.at[idx,cols[1]] = total_m
            df_summary.at[idx,cols[2]] = total_dash
            df_summary.at[idx,cols[3]] = total_t

        idx += 1
    
    return df_summary

### Import Data

In [None]:
# import weather data

df = pd.read_csv(input_path + 'weather.csv')

### Inspect Data

As indicated below, the weather dataset has 2944 rows and 22 features. Some of the columns' datatypes are listed as object as they contain missing values. We will need to rectify those later. 

In [None]:
# print first 5 records

print(df.shape)
df.head()

In [None]:
# list all columns

print(df.columns)

In [None]:
# df summary

df.describe()

In [None]:
df = clean_date(df)

In [None]:
# check columns type

df.dtypes

As indicated in the summary below, half of the entries (1472 out of 2944) in `depart`, `sunrise` and `sunset` columns have missing values. For `sunrise` and `sunset` columns, we did some desktop research and understand that this was because station 2 does not collect data for these columns. We have thus decided to impute these missing values with values from station 1. 

For column `water1`, all of its entries are missing values `M`, hence we should drop this column. We have also decided to drop `snowfall` and `depth` columns as their entries are either 0 or missing. 

There are `T` values in `snowfall` and `precitotal`. Based on the data documentation, this means that there are trace precipitate for that entry. Hence, we decided to round these values to 0. For the missing values in `preciptotal`, we decided to impute it with median values. Likewise for `avgspeed`, `sealevel` and `stnpressure`. 

In [None]:
# count total number of M, -,  T in df

print('before cleaning:')
df_summary = print_summary(df)
df_summary

In [None]:
df.snowfall.unique()

In [None]:
df.depth.unique()

In [None]:
df.drop(columns = ['codesum','water1','snowfall','depth'], inplace = True)

In [None]:
df.head()

In [None]:
i = 0
while i < df.shape[0]:
    df.iloc[i+1, 4] = df.iloc[i, 4] #depart column
    df.iloc[i+1, 9] = df.iloc[i, 9] #sunrise
    df.iloc[i+1, 10] = df.iloc[i, 10] #sunset
    i+=2

In [None]:
def impute_missing_tavg(row):
    if row['tavg'] == 'M': 
        row['tavg'] = (row['tmax'] - row['tmin']) * 0.5 + row['tmin']
    return row

df = df.apply(impute_missing_tavg, axis = 1)
df.tavg = df.tavg.astype('int64')

In [None]:
def impute_missing_wetbulb(row): 
    if row['wetbulb'] == 'M':
        row['wetbulb'] = row['tavg']-((row['tavg']-row['dewpoint'])/3)
    return row

df = df.apply(impute_missing_wetbulb, axis = 1)

In [None]:
def impute_missing_rest(row): 

    if row['heat'] == 'M':
        if row['tavg'] >= 65: 
            row['heat'] = 0
            row['cool'] = row['tavg'] - 65
        else: 
            row['heat'] = 65 - row['tavg']
            row['cool'] = 0

    if row['preciptotal'] == '  T':
        row['preciptotal'] = 0
    if row['preciptotal'] == 'M':
        row['preciptotal'] = df[df.preciptotal!='M'][df.preciptotal!='  T'].preciptotal.median()       
    if row['stnpressure'] == 'M':
        row['stnpressure'] = df[df.stnpressure!='M'].stnpressure.median()
    if row['sealevel'] == 'M':
        row['sealevel'] = df[df.sealevel!='M'].sealevel.median()
    if row['avgspeed'] == 'M':
        row['avgspeed'] = df[df.avgspeed!='M'].avgspeed.median()    
    return row

df = df.apply(impute_missing_rest, axis = 1)

### Clean Data

We checked and confirmed that there are no more missing values before saving our processed dataset. 

In [None]:
print('after cleaning:')    
df_summary = print_summary(df)
df_summary

In [None]:
df.head()

### Output Data

In [None]:
df.head()

In [None]:
# output clean data

df.to_csv(clean_path + 'weather_clean.csv',index=False)