This notebook examines and adds context to the many data files supplied in the competition. It includes the original description of it along with examples to better describe the contents and purpose of each.

Import packages:

In [None]:
import gresearch_crypto

import pandas as pd

Declare variables:

In [None]:
asset_details_filepath = '/kaggle/input/g-research-crypto-forecasting/asset_details.csv'
ex_sample_submission_filepath = '/kaggle/input/g-research-crypto-forecasting/example_sample_submission.csv'
ex_test_filepath = '/kaggle/input/g-research-crypto-forecasting/example_test.csv'
supplemental_train_filepath = '/kaggle/input/g-research-crypto-forecasting/supplemental_train.csv'
train_filepath = '/kaggle/input/g-research-crypto-forecasting/train.csv'

Import data:

In [None]:
asset_details_df = pd.read_csv(asset_details_filepath)
ex_sample_submission_df = pd.read_csv(ex_sample_submission_filepath)
ex_test_df = pd.read_csv(ex_test_filepath)
supplemental_train_df = pd.read_csv(supplemental_train_filepath)
train_df = pd.read_csv(train_filepath)

#### Asset Details Explanation:

This file is straight forward with the description provided in the competition of:  
`Provides the real name and of the cryptoasset for each Asset_ID and the weight each cryptoasset receives in the metric.`

Weights are used for calculation of the evaluation metric, where certain coins are more important for the metric than others. See the [tutorial](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) for a detailed calculation.

In [None]:
asset_details_df.tail()

#### Example Test and Sample Submission Explanation

Both of these two files have identical descriptions of:  
`An example of the data that will be delivered by the time series API.`  
where the example sample submission has the additional description of:  
`The data is just copied from train.csv.`

Also shown in the above tutorial, these files are examples of what is used during the official submission process. Submission code includes the following parts:

`iter_test = env.iter_test()`

`# get the data for the first test batch`  
`# this line is indented in a loop in the full code`  
`(test_df, sample_prediction_df) = next(iter_test)`

In this example, ex_test_df corresponds to test_df and ex_sample_submission_df corresponds to sample_prediction_df. The test dataframes are new observations to make predictions on, while the prediction dataframes are for storing the predictions once made by models. In fact, the `predict` function is described as:  
`Stores your predictions for the current batch. Expects the same format as sample_prediction_df.`
with the given example of:  
`env.predict(sample_prediction_df)`

In [None]:
ex_test_df.tail()

In [None]:
ex_sample_submission_df.tail()

One point to examine when making actual submissions is that in the tutorial, neither the test or prediction dataframes have a column for group number. However, both provided files have this column as shown above. The test data also has a separate Asset ID column, so group number does not indicate the type of coin, but something else that may or may not be important.

#### Train and Supplemental Train Explanation

The training file is described succintly as `The training set` while the supplemental training file is given the much longer description of:  
`After the submission period is over this file's data will be replaced with cryptoasset prices from the submission period. In the Evaluation phase, the train, train supplement, and test set will be contiguous in time, apart from any missing data. The current copy, which is just filled approximately the right amount of data from train.csv is provided as a placeholder.`



In [None]:
train_df.tail()

In [None]:
supplemental_train_df.tail()

Both can be seen to have the same format as expected. However, a large piece of outstanding information is how these different files will actually interact during the evaluation period. For example, if supplemental training data is intended to be used for training as the name implies, the final notebook will need to conduct both this training AND evaluation within the time limit. If this is done, there also would be no way to compare or choose manually whether to use a model with the supplemental train data or one only using the original train data. 

For ease and less complication, it will probably be best to only use supplemental train as the starting input to any time series model rather than also use it for training.