# DATA 1030 Midterm Project Notebook - Part I, Creating Target Variables
* The purpose of this notebook is to manipulate the raw data by converting target columns (initially measured as time taken, in seconds, to prove theorem) to binary labels, and export the new dataset for further use
* Features and descriptions are also listed

In [1]:
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pylab as plt

### 1. Manually create feature names

* The dataset did not come with a header row, so I had to construct one 
* Referenced the original paper (Bridge et al) for the feature names and the five heuristics
* The column labels are created below for the 14 static features S1 through S14 (columns 0-13) and the 39 dynamic features D1 through D39 (columns 14-52), and the time measurements from each heuristic for the row corresponding to a specific problem

In [2]:
columns_type = ['S']*14 + ['D']*39
indices = [i for i in range (1,15)] + [i for i in range(1,40)]
H_Time = ['H1_Time', 'H2_Time', 'H3_Time', 'H4_Time','H5_Time']
type_join_number = [columns_type[i] + str(indices[i]) for i in range(0,53)] 
columns = type_join_number + H_Time

### 2. Read in the data and files for feature descriptions

In [3]:
df = pd.read_csv('../data/all-data-raw.csv', header=None, names=columns)

# Scraped PDF table and converted to CSV to retrieve the features from two tables in the paper
static_featnames = pd.read_csv('../data/static-features.csv') 
dynamic_featnames = pd.read_csv('../data/dynamic-features.csv')

full_features = pd.concat([static_featnames, dynamic_featnames], ignore_index=True)
full_features.drop('Feature number', axis= 1, inplace=True)
col_as_df = pd.DataFrame(columns, columns=['Feature Type and Number'])
full_features = col_as_df.merge(full_features, left_index = True, right_index=True)

### 3. Features and Descriptions

In [4]:
full_features

Unnamed: 0,Feature Type and Number,Description
0,S1,Fraction of clauses that are unit clauses
1,S2,Fraction of clauses that are Horn clauses
2,S3,Fraction of clauses that are ground Clauses
3,S4,Fraction of clauses that are demodulators
4,S5,Fraction of clauses that are rewrite rules (or...
5,S6,Fraction of clauses that are purely positive
6,S7,Fraction of clauses that are purely negative
7,S8,Fraction of clauses that are mixed positive an...
8,S9,Maximum clause length
9,S10,Average clause length


###  4. Conversion of `H_Time` columns to target columns with binary classification labels
* The initial target coluns are not labels, but rather the time taken (in seconds) for each heuristic to prove the theorem/conjecture. 
* There was a time limit of 100 seconds.
* An entry of -100 denotes failure to obtain a proof within the time limit.
* Using numpy array instead of pandas DF made calculating min and argmin value simpler (in my opinion)
* `numpy.ma.MaskedArray` allows to mask the -100 values to calculate the "true" min when one heuristic proved to be optimal amongst one or more other heuristics which failed to prove the theorem in the alotted time.
* Target columns were converted to 0/1 (six columns) and also to a 0-5 labeled single column.
* The final target column is the 0-5 labeled column.

In [5]:
df_heuristics_array = df[H_Time].values
H = np.zeros(shape=(df.shape[0], 6), dtype=np.int8)
y = np.zeros(shape=(df.shape[0],1), dtype=np.int8)
validate = ['']*df.shape[0] # To cross-reference with the author's data

In [6]:
for index in range(df.shape[0]):  
    # If values are -100.0 in every HX_Time columns, then H0 (decline proof) is best heuristic (col 5 in H)
    if np.all(df_heuristics_array[index] == -100.0):
        H[index][0] = 1
        validate[index] = 'H0_Time' # validate array for confirming transformed results
    else:
        # Mask values < 0 (i.e. all -100.0 values) and determine fastest heuristic with positive minimum time
        best_nonH0 = np.ma.MaskedArray(df_heuristics_array, df_heuristics_array<0)
        argmin = np.argmin(best_nonH0[index])
        H[index][argmin+1] = 1
        y[index] = argmin + 1
        validate[index] = H_Time[argmin]

### 5. Validate conversions based on training data from author
* Before converting to the single-column target variable, it was important to check that my conversions are in line with the known values from the author-derived training data.
* My data uses 0 for indicating a non-optimal heuristic, and 1 for an optimal heuristic. We can still easily validate the data, by comparing strings corresponding to the heuristic. In the cell above, I filled a list called `validate` with the string from the labels for `H_Time` retrieved using the calculated argmin. I found that this was also very clear for checking validity when printing the values out as you can see exactly what the output is without needing to convert indices in your head.
* The author had removed two of the features (Static feature 5 and dynamic feature 21) in his data so I removed these two from the `train_columns` index names as well, for column labeling to work properly.
* The two features were removed due to the fact that each one takes on the same value for all data points, as can be seen in the EDA of single columns above.

In [7]:
train_H_Time = ['H1_Time', 'H2_Time', 'H3_Time', 'H4_Time','H5_Time', 'H0_Time']
train_columns = [columns_type[i] + str(indices[i]) for i in range(0,53)] + train_H_Time 
train_columns.remove('S5')
train_columns.remove('D21')

df_train = pd.read_csv('../data/bridge-data/train.csv', header=None, names=train_columns)
df_H = df_train[train_H_Time]
train_validate = ['']*df_H.shape[0]

for i in range(df_H.shape[0]):
    train_validate[i] = df_H.iloc[i].idxmax(axis='columns')
    assert train_validate[i] == validate[i], 'something is wrong at index {}'.format(i)
print('All results validated') # print if all assertions pass

All results validated


### 6. Construct single column target variable as DataFrame
* Labels $ X \in \{ 0,1,2,3,4,5\} $ correspond to Heuristic X being optimal for a specific row/data point

In [8]:
target_multiple = pd.DataFrame(H, columns=['H0_Best','H1_Best', 'H2_Best', 'H3_Best', 'H4_Best', 'H5_Best'])
target_single = pd.DataFrame(y, columns=['Best Heuristic'])
df_target = target_multiple.merge(target_single, left_index=True, right_index=True)
df_target.head()

Unnamed: 0,H0_Best,H1_Best,H2_Best,H3_Best,H4_Best,H5_Best,Best Heuristic
0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,1
2,1,0,0,0,0,0,0
3,1,0,0,0,0,0,0
4,1,0,0,0,0,0,0


### 6. Replace target columns with variable transformation in the raw data
* Merge on index

In [9]:
df_features = df.drop(H_Time, axis=1, inplace=False)
df_merged = df_features.merge(df_target, left_index=True, right_index=True)
df_merged.head()

Unnamed: 0,S1,S2,S3,S4,S5,S6,S7,S8,S9,S10,...,D37,D38,D39,H0_Best,H1_Best,H2_Best,H3_Best,H4_Best,H5_Best,Best Heuristic
0,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.73872,0.073308,0.18797,1,0,0,0,0,0,0
1,0.83307,0.99682,0.83307,0.76948,0,0.77107,0.068363,0.16057,6,1.2734,...,0.74436,0.067669,0.18797,0,1,0,0,0,0,1
2,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.74248,0.069549,0.18797,1,0,0,0,0,0,0
3,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.7312,0.080827,0.18797,1,0,0,0,0,0,0
4,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.73308,0.078947,0.18797,1,0,0,0,0,0,0


### 7. Export data and merged features to CSV files

In [10]:
df_merged.to_csv('../data/raw_data_plus_labeled_targets.csv', index=False)
full_features.to_csv('../data/features_plus_descriptions.csv', index=False)