<h1 align="center">Predicting target set using profiling set only</h1>

In [None]:
%matplotlib inline
from inc.notebook005 import *

# One set of features for all applications

For each feature set and for each application, we train the model with the profiling set using cross validation and measure RMSE for the both the profiling validation set and target (test) set. Then, we plot the results for two feature sets chosen by the profiling validation results. The first feature set has the minimum RMSE mean of all applications and the second one has the minimum RMSE maximum.

In [None]:
predictor = Predictor()

## Best Mean RMSE: input/workers

### RMSE

In [None]:
features = [('input/workers', lambda df: df.input/df.workers)]
predictor.set_features(features)
predictor.print_rmse()

### Plots

In [None]:
plot_all(predictor)

## Best Max RMSE: log(input), log(workers), y = log(duration_ms)

### RMSE

In [None]:
predictor.use_log = True
features = [('log(input)', lambda df: np.log2(df.input)),
            ('log(workers)', lambda df: np.log2(df.workers))]
predictor.set_features(features)
predictor.print_rmse()

### Plots

In [None]:
plot_all(predictor)
predictor.use_log = False

## Outros testados

### log(input/workers), y = log(duration_ms)

In [None]:
predictor.use_log = True
features = [('log(input/workers)', lambda df: np.log2(df.input/df.workers))]
predictor.set_features(features)
predictor.print_rmse()
predictor.use_log = False

### input/workers, input, workers

In [None]:
features = [('input', lambda df: df.input),
            ('workers', lambda df: df.workers),
            ('input/workers', lambda df: df.input/df.workers)]
predictor.set_features(features)
predictor.print_rmse()

### input, workers

In [None]:
features = [('input', lambda df: df.input),
            ('workers', lambda df: df.workers)]
predictor.set_features(features)
predictor.print_rmse()

### log(input)/log(workers), y = log(duration_ms)

In [None]:
predictor.use_log = True
features = [('log(input)/log(workers)', lambda df: np.log2(df.input - df.workers))]
predictor.set_features(features)
predictor.print_rmse()
predictor.use_log = False

## Conclusion

The prediction using *input/workers* has high errors for the biggest target input size of both HB Sort and K-means. In contrast, when using *log(input)* and *log(workers)* to predict *log(duration)*, there are high errors for the Wikipedia app, a bit lower errors for HB Sort and significantly better results for HB K-means. None of the tested feature sets leads to good results for all applications.

# Multiple feature sets for each application

Now, for each application, we choose the best feature set for the profiling set (using cross validation).

In [None]:
feature_sets = (
    False,  # do not use log for makespan
    ('input/workers', lambda df: df.input/df.workers)
),(
    True,  # predict log(makespan)
    ('log(input)', lambda df: np.log2(df.input)),
    ('log(workers)', lambda df: np.log2(df.workers))
),(
    False,  # do not use log for makespan
    ('input', lambda df: df.input),
    ('workers', lambda df: df.workers)
),(
    False,  # do not use log for makespan
    ('input', lambda df: df.input),
    ('workers', lambda df: df.workers),
    ('input/workers', lambda df: df.input/df.workers)
),(
    True,  # predict log(makespan)
    ('log(input/workers)', lambda df: np.log2(df.input/df.workers))
),(
    True,  # predict log(makespan)
    ('log(input)/log(workers)', lambda df: np.log2(df.input - df.workers))
)

## Results

In [None]:
evaluate_feature_sets(feature_sets)

## Conclusion

The best feature set for the profiling set is not the best one for the target set. For example, in HB K-means, the best features for the profiling set is *input*, *workers*, *input/workers*, but its RMSE when predicting the target set is very high (125.01 sec).