# Time-Series Reconstruction
After verifying the Kaggle community's selection of important features, I will proceed to reconstruct the time-series dataset. Reconstruction will be based on this public kernel: https://www.kaggle.com/johnfarrell/breaking-lb-fresh-start-with-lag-selection

In [1]:
# Load libraries
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import pdb
import os
import h5py
import pickle

from sklearn.metrics import mean_squared_error

## Helper Functions:

In [2]:
# Function for loading h5py file
def load_h5py(fname):
    with h5py.File(fname, 'r') as handle:
        return handle['data'][:]
# Function for loading pickle file
def load_pickle(fname):
    with open(fname, 'rb') as handle:
        return pickle.load(handle)

In [3]:
# Function for setting up
def get_input(debug=False):
    '''
    Function for loading either debug or full datasets
    '''
    os.chdir('../../data/compressed/')
    print os.getcwd()
    pkl_files = ['train_id.pickle', 'trainidx.pickle', 'target.pickle', 'test_id.pickle', 'testidx.pickle']
    if debug:
        print 'Loading debug train and test datasets...'
        # h5py files
        train = load_h5py('debug_train.h5')
        test = load_h5py('debug_test.h5')
        # pickle files
        id_train, train_idx, target, id_test, test_idx = [load_pickle('debug_%s'%f) for f in pkl_files]
    else:
        print 'Loading original train and test datasets...'
        # h5py files
        train = load_h5py('full_train.h5')
        test = load_h5py('full_test.h5')
        # pickle files
        id_train, train_idx, target, id_test, test_idx = [load_pickle('full_%s'%f) for f in pkl_files]
    # Load feature names
    fnames = load_pickle('feature_names.pickle')
    # Find shape of loaded datasets
    print('Shape of training dataset: {} Rows, {} Columns'.format(*train.shape))
    print('Shape of test dataset: {} Rows, {} Columns'.format(*test.shape))
    os.chdir('../../scripts/time_series/')
    print os.getcwd()
    return fnames, train, id_train, train_idx, target, test, id_test, test_idx

In [4]:
# Function for getting datasets in dataframe format
def get_dataframes(debug=False):
    # Load data
    fnames, train, id_train, train_idx, target, test, id_test, test_idx = get_input(debug)
    # Format data
    train_df = pd.DataFrame(data=train, index=train_idx, columns=fnames)
    train_df['ID'] = id_train
    train_df['target'] = target
    test_df = pd.DataFrame(data=test, index=test_idx, columns=fnames)
    test_df['ID'] = id_test
    test_df['target'] = train_df['target'].mean()
    
    print('\nShape of training dataframe: {} Rows, {} Columns'.format(*train_df.shape))
    print('Shape of test dataframe: {} Rows, {} Columns'.format(*test_df.shape))
    return fnames, train_df, test_df

In [5]:
# Function for getting predictions with certain lag assumption
def _get_leak(df, cols, lag=0):
    # All columns except last two + lag into tuple
    d1 = df[cols[:-lag-2]].apply(tuple, axis=1).to_frame().rename(columns={0: 'key'})
    # All columns except first two + lag into tuple
    d2 = df[cols[lag+2:]].apply(tuple, axis=1).to_frame().rename(columns={0: 'key'})
    d2['pred'] = df[cols[lag]]
    # Remove duplicate keys so that join operation will work
    d3 = d2[~d2.duplicated(subset=['key'], keep=False)]
    # Return 'pred' result of d1 and d3 
    return d1.merge(d3, how='left', on='key')['pred'].fillna(0)

In [6]:
# Function for rewriting leaky dataset UP TO best leak value
def rewrite_compiled_leak(leak_df, lag):
    # Reset compiled_leak field
    leak_df['compiled_leak'] = 0
    for i in range(lag):
        c = 'leaked_target_%s'%str(i)
        zeroleak = leak_df['compiled_leak']==0
        leak_df.loc[zeroleak, 'compiled_leak'] = leak_df.loc[zeroleak, c]
    return leak_df

## Main Script:

In [7]:
try:
    del fnames, train, test
    print 'Clearing loaded dataframes from memory...\n'
except:
    pass
fnames, train, test = get_dataframes(debug=False)

/Users/cheng-haotai/Projects_Data/santander-value-prediction/data/compressed
Loading original train and test datasets...
Shape of training dataset: 4459 Rows, 4991 Columns
Shape of test dataset: 49342 Rows, 4991 Columns
/Users/cheng-haotai/Projects_Data/santander-value-prediction/scripts/time_series

Shape of training dataframe: 4459 Rows, 4993 Columns
Shape of test dataframe: 49342 Rows, 4993 Columns


In [8]:
# Load important features
cols = load_pickle('./important.pickle')

In [9]:
# Format target
y = np.log1p(train['target']).values
log_mean = y.mean()

### Explanation of Leak Compilation:
**IMPORTANT:** This entire leak compilation procedure heavily relies upon the candidate time-series features being ordered such that they are in the proper time-series format.

The Kaggle community refers to **leak** as the concept of forecasting the target value based on the time-series nature of the Santander dataset. Since the columns and rows of the Santander dataset are scrambled time-series features, it becomes possible to make predictions on the target variable solely by setting them equal to the values of another time-series feature. The candidate time-series features for this replacement operation have been identified by the Kaggle community and subsequently partially-validated by me in a separate analysis. 

The nature of this leak is demonstrated in one of the Kaggle community's public kernels: https://www.kaggle.com/johnfarrell/giba-s-property-extended-extended-result. 

In this public kernel, it can be seen that (given that the important features are ordered in the proper time-series fashion) the target variable is simply the generated time-series at a **lag of 2**. As such, it can be assumed that this **2** lag value is an important artifact of the dataset that can be used for making predictions of the target variable.

This lag value is accounted-for in the **get_leak** function. The **get_leak** function details an algorithm for finding an appropriate forecast for the target variable, given a certain lag value (default of 2). This function takes the ordered time series data, creates two different sets of tuples consisting of the ordered data spliced at two separate ends, and compares these tuple sets for matches. The general motivation behind this algorithm is rooted in the idea that the target variable is equivalent to values within the important features, but with a time offset (2, as was shown in the linked public kernel above). 

Assuming that the ordering produced by the Kaggle community is correct, then snipping samples at both ends and searching for any matches through a join operation is essentially the same as checking if any particular sample is part of a longer sequence. If it is part of a longer sequence, then we can exercise the foundational assumption that the target variable is simply a temporal offset of the time-series and set the target variable to a value of 2 time offsets prior to the beginning of the sequence.

This process is repeated for many lag values from 2 (default) until 38 (the number of important columns subtracted by 2). The reason why different lag values are used is because it is entirely possible that these 40 important features identified are not enough to fully represent the time-series when extended to the entire dataset. (Keep in mind that the tables shown in the above linked public kernel are only of a few select samples out of the entire dataset.) 

When extrapolated to the entire dataset, these 40 important features most likely have discontinuous time steps which is indicative of incomplete time seqeunces. As such, it is important to consider shorter segments when creating matches so that these incomplete time sequences can still have predictive value. That being said, longer sequences (shorter lag) have an inherently better predictive value and their predictions are prioritized over the shorter sequences (longer lag).

As such, the leak compilation procedure aims to make predictions on the target variable by searching for sequence matches that can be used to leverage the **2-lag** behavior of the Santander dataset. Longer sequence matches are given higher priority than shorter sequence matches. 

### Leak Compilation for Training Set:

In [10]:
extra_cols = ['compiled_leak', 'nonzero_mean']

# Function for compiling leak results over many lag values
def compiled_leak_result():
    # Define number of lag values to consider
    max_nlags = len(cols)-2
    # Define leaky train set
    train_leak = train[['ID', 'target'] + list(cols)]
    # Initialize compiled_leak as zeros
    train_leak['compiled_leak'] = 0
    train_leak['nonzero_mean'] = train[fnames].apply(lambda x: np.expm1(np.log1p(x[x!=0]).mean()), axis=1)
    # Initialize empty lists
    scores = []
    leaky_value_counts = []
    leaky_value_corrects = []
    leaky_cols = []
    
    for i in range(max_nlags):
        c = 'leaked_target_%s'%str(i)
        
        print '\nProcessing Lag:', i
        # Get predictions for current lag and store in new column
        train_leak[c] = _get_leak(train_leak, cols, i)
        # Update leaky_cols with latest lag label
        leaky_cols.append(c)
        # Get "grounding" by joining with original training dataset
        train_leak = train.join(train_leak.set_index('ID')[leaky_cols + extra_cols], 
                                on='ID', how='left')[['ID', 'target'] + list(cols) + leaky_cols + extra_cols]
        # Iteratively fill in compiled_leak values for increasing lag
        zeroleak = train_leak['compiled_leak'] == 0
        train_leak.loc[zeroleak, 'compiled_leak'] = train_leak.loc[zeroleak, c]
        
        # Number of leaky values found so far
        leaky_value_counts.append(np.sum(train_leak['compiled_leak']>0))
        # Number of correct discovered leaky values
        _correct_counts = np.sum(train_leak['compiled_leak']==train_leak['target'])
        # Percentage of correct discovered leaky values
        leaky_value_corrects.append(1.0*_correct_counts/leaky_value_counts[-1])
        
        print 'Number of leak values found in train:', leaky_value_counts[-1]
        print 'Percentage of correct leak values in train:', leaky_value_corrects[-1]
        
        # Find score of current compilation iteration
        tmp = train_leak.copy()  # Temporary dataframe
        tmp.loc[zeroleak, 'compiled_leak'] = tmp.loc[zeroleak, 'nonzero_mean']
        scores.append(np.sqrt(mean_squared_error(y, np.log1p(tmp['compiled_leak']).fillna(log_mean))))
        
        print 'Score (filled with nonzero mean):', scores[-1]
    
    # End of iterations
    result = dict(score=scores,
                  leaky_count = leaky_value_counts,
                  leaky_correct = leaky_value_corrects)
    
    return train_leak, result

In [11]:
# Get leaked training data and result
train_leak, result = compiled_leak_result()


Processing Lag: 0
Number of leak values found in train: 1351
Percentage of correct leak values in train: 0.9955588453
Score (filled with nonzero mean): 1.5138333391635188

Processing Lag: 1
Number of leak values found in train: 1947
Percentage of correct leak values in train: 0.996404725218
Score (filled with nonzero mean): 1.2922048129527162

Processing Lag: 2
Number of leak values found in train: 2340
Percentage of correct leak values in train: 0.99358974359
Score (filled with nonzero mean): 1.1732829046778304

Processing Lag: 3
Number of leak values found in train: 2586
Percentage of correct leak values in train: 0.993039443155
Score (filled with nonzero mean): 1.084326373672634

Processing Lag: 4
Number of leak values found in train: 2754
Percentage of correct leak values in train: 0.993464052288
Score (filled with nonzero mean): 1.0327870440015579

Processing Lag: 5
Number of leak values found in train: 2899
Percentage of correct leak values in train: 0.993101069334
Score (filled

In [12]:
# Format results 
result = pd.DataFrame.from_dict(result, orient='columns')
result.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
leaky_correct,0.995559,0.996405,0.99359,0.993039,0.993464,0.993101,0.992369,0.991961,0.991844,0.991659,...,0.985412,0.984603,0.98326,0.982451,0.981644,0.979508,0.979246,0.978979,0.978178,0.978178
leaky_count,1351.0,1947.0,2340.0,2586.0,2754.0,2899.0,3014.0,3110.0,3188.0,3237.0,...,3633.0,3637.0,3644.0,3647.0,3650.0,3660.0,3662.0,3663.0,3666.0,3666.0
score,1.513833,1.292205,1.173283,1.084326,1.032787,0.994032,0.947572,0.906143,0.882938,0.864543,...,0.737684,0.737191,0.739431,0.738832,0.739338,0.740242,0.739891,0.740415,0.743261,0.742745


In [13]:
# Save results
train_res_name = './stats/train_leaky_stat.csv'
result.to_csv(train_res_name, index=False)

In [14]:
# Find best score and lag value
best_score = np.min(result['score'])
best_lag = np.argmin(result['score'])
print 'Best score:', best_score
print 'Best lag value:', best_lag

Best score: 0.7371911838169722
Best lag value: 29


In [15]:
# Rewrite leaky training set in terms of best lag
leaky_cols = [c for c in train_leak.columns if 'leaked_target_' in c]
train_leak = rewrite_compiled_leak(train_leak, best_lag)

train_leak_name = './stats/train_leak.csv'
# Save train_leak
train_res = train_leak[leaky_cols + ['compiled_leak']].replace(0.0, np.nan)
train_res.to_csv(train_leak_name, index=False)

### Leak Compilation for Test Set:

In [16]:
# Function for compiling leaky values for test set
def compiled_leak_result_test():
    max_nlags = len(cols)-2
    
    test_leak = test[['ID', 'target'] + list(cols)]
    test_leak['compiled_leak'] = 0
    test_leak['nonzero_mean'] = test[fnames].apply(lambda x: np.expm1(np.log1p(x[x!=0]).mean()), axis=1)
    
    leaky_value_counts = []
    leaky_cols = []
    
    for i in range(max_nlags):
        c = 'leaked_target_%s'%str(i)
        
        print '\nProcessing Lag:', i
        test_leak[c] = _get_leak(test_leak, cols, i)
        leaky_cols.append(c)
        
        test_leak = test.join(test_leak.set_index('ID')[leaky_cols + extra_cols], 
                              on='ID', how='left')[['ID', 'target'] + list(cols) + leaky_cols + extra_cols]
        zeroleak = test_leak['compiled_leak']==0
        test_leak.loc[zeroleak, 'compiled_leak'] = test_leak.loc[zeroleak, c]
        leaky_value_counts.append(np.sum(test_leak['compiled_leak']>0))
        
        print 'Number of leaky values found in test:', leaky_value_counts[-1]
        
    # End iterations
    result = dict(leaky_count = leaky_value_counts)
    
    return test_leak, result

In [17]:
# Get leaked test data and result
test_leak, test_result = compiled_leak_result_test()


Processing Lag: 0
Number of leaky values found in test: 2963

Processing Lag: 1
Number of leaky values found in test: 4215

Processing Lag: 2
Number of leaky values found in test: 4960

Processing Lag: 3
Number of leaky values found in test: 5503

Processing Lag: 4
Number of leaky values found in test: 5917

Processing Lag: 5
Number of leaky values found in test: 6208

Processing Lag: 6
Number of leaky values found in test: 6426

Processing Lag: 7
Number of leaky values found in test: 6583

Processing Lag: 8
Number of leaky values found in test: 6742

Processing Lag: 9
Number of leaky values found in test: 6872

Processing Lag: 10
Number of leaky values found in test: 6983

Processing Lag: 11
Number of leaky values found in test: 7080

Processing Lag: 12
Number of leaky values found in test: 7159

Processing Lag: 13
Number of leaky values found in test: 7243

Processing Lag: 14
Number of leaky values found in test: 7303

Processing Lag: 15
Number of leaky values found in test: 7366

P

In [18]:
# Format test results
test_result = pd.DataFrame.from_dict(test_result, orient='columns')
test_result.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
leaky_count,2963,4215,4960,5503,5917,6208,6426,6583,6742,6872,...,7835,7887,7949,8025,8117,8277,8478,8685,9012,9501


In [19]:
# Save test results
test_res_name  = './stats/test_leaky_stat.csv'
test_result.to_csv(test_res_name, index=False)

In [20]:
# Rewrite leaky test set in terms of best lag
test_leak = rewrite_compiled_leak(test_leak, best_lag)

test_leak_name = './stats/test_leak.csv'
# Save test_leak
test_res = test_leak[leaky_cols + ['compiled_leak']].replace(0.0, np.nan)
test_res.to_csv(test_leak_name, index=False)

### Make Submission:

In [21]:
# Replace zeros in compiled_leak field
test_leak.loc[test_leak['compiled_leak']==0, 'compiled_leak'] = test_leak.loc[test_leak['compiled_leak']==0, 
                                                                              'nonzero_mean']

In [22]:
submit_name = '../../submissions/recon_0_lag%s_submit.csv'%best_lag
# Make and save submission
sub = pd.DataFrame()
sub['ID'] = test['ID']
sub['target'] = test_leak['compiled_leak']
sub.to_csv(submit_name, index=False)

In [23]:
sub.head()

Unnamed: 0,ID,target
0,000137c73,2209645.0
1,00021489f,1099644.0
2,0004d7953,1276252.0
3,00056a333,7871320.0
4,00056d8eb,2868883.0
