This notebook develops data transformations and formatting to prepare for use in a model. Code is intended to be functional and able to be copied to new notebooks for submission or other use.

## Package imports:

In [1]:
import gresearch_crypto

from datetime import datetime
import pandas as pd

## Declare variables:

In [2]:
train_filepath = '/kaggle/input/g-research-crypto-forecasting/train.csv'
asset_details_filepath = '/kaggle/input/g-research-crypto-forecasting/asset_details.csv'

## Import data:

In [3]:
train_df = pd.read_csv(train_filepath)
asset_details_df = pd.read_csv(asset_details_filepath)

In [4]:
train_df.tail()

In [5]:
asset_details_df.tail()

## Define functions for data cleaning / transforms:

In [6]:
def clean_dates(df):
    '''
    Function to clean timestamps of an individual coin 
    by the earliest and latest timestamps observed for that coin.
    Also fills in missing values by the method chosen for the interpolate function.
    
    Inputs: 
        df (pd.DataFrame.GroupBy object) :
            Grouped Dataframe by unique coins. All timestamps must be
            in intervals of 60 seconds.

    Outputs:
        constant_dates_df (pd.DataFrame) :
            Dataframe with timestamps and filled missing values.
    '''
    
    df = df.copy()
    
    dates = range(min(df["timestamp"]), max(df["timestamp"]), 60)
    
    df.set_index("timestamp", inplace = True)
    
    df = df.reindex(dates)
    
    # don't fill missing values at end of dataset, where they do not
    # have ending observations to interpolate with
    df.interpolate(method = "linear", inplace = True, limit_area = "inside")
    
    return df
    
def standardize_data(df):
    '''
    Function to standardize data by creating rows for every timestamp
    and subsetting to only consider when all coins had their first observation made.
    
    Inputs:
        df (pd.DataFrame) :
            Time series data to be standardized
            
    Outputs:
        standard_df (pd.DataFrame) :
            Time series data now standardized
    '''
    
    # deep copy to not alter the original
    df = df.copy()
    
    # fill missing rows / values between coin's start and stop date
    # Note: Missing rows beyond an individual coin's start / stop date are not created,
    # only those between are filled in
    standard_df = df.groupby("Asset_ID").apply(clean_dates).reset_index(level = 0, drop = True)
    
    # reset twice so timestamp is only a column and not also index, 
    # this makes each entry have a unique index
    standard_df = standard_df.reset_index()
    
    # get the earliest timestamp for each coin, then get the latest timestamp out of those.
    # this shows when the latest coin was introduced, after which there are observations for
    # all coins
    first_timestamp = max(standard_df.groupby("Asset_ID")["timestamp"].min())
    
    # subset to only consider the time period where observations existed for all coins
    standard_df = standard_df.loc[standard_df["timestamp"] >= first_timestamp]
    
    # drop ending rows with missing values
    standard_df.dropna(inplace = True)
    
    return standard_df

## Show problems with dataset that need cleaning:

 - Many missing target values

In [7]:
train_df.isna().sum()

- Different starting dates for each coin

In [8]:
train_df.groupby("Asset_ID")["timestamp"].min()

- All coins do have the same ending dates

In [9]:
train_df.groupby("Asset_ID")["timestamp"].max()

- Related to above, coins have different number of observations

In [10]:
train_df.groupby("Asset_ID").size()

- Coins have different periods between observations

In [11]:
train_df.groupby("Asset_ID")["timestamp"].diff().value_counts().head()

## Apply cleaning functions to standardize data:

In [12]:
result = standardize_data(train_df)

In [13]:
result

Note that early timestamps are missing for all coins because of the subset.

In [14]:
# show first timestamp
result.loc[result["timestamp"] == 1514764860]

Also note that all coins now have the same number of rows

In [15]:
result.groupby("Asset_ID").size()

And that there is now even spacing between each coin's entries

In [16]:
result.groupby("Asset_ID")["timestamp"].diff().value_counts().head()

## Make train / test split

For the purpose of modeling, a meaningful train / test split must be created to ensure models are chosen under conditions close to the evaluation period. The evaluation period is three months long, so a test period of equal length from the training dataset can be used to evaluate models against. 

In [17]:
latest_timestamp = result.loc[:, "timestamp"].max()

# 60 seconds per minute, 60 minutes per hour, 24 hours per day, ~30 days per month,
# for 3 months
cutoff_timestamp = 60*60*24*30*3

standard_train_df = result.loc[(latest_timestamp - result["timestamp"]) > cutoff_timestamp]
standard_test_df = result.loc[(latest_timestamp - result["timestamp"]) <= cutoff_timestamp]

In [18]:
standard_train_df

In [19]:
standard_test_df