        
# Stage 2: Real-world track

## Judging Criteria for the Real-World Track


For the real-world track, __[Dr. Maimuna Majumder](https://maimunamajumder.com/)__ judged the quality of the SR methods 
in terms of the models they produce to predict 2-week counts of cases, hospitalizations, and deaths in the COVID-19 pandemic.
This assessment carried out by the expert, is subjective and based on their expertise.

In more detail, for any given real-world data set:

1. Each method ws run 10 times, producing 10 models. 
Each one of these 10 models is tested on the test set, and the model with median `accuracy` will be considered the representative model for the expert to consider.
For this model, `accuracy` and `simplicity`, as per the definitions above, will be reported to the expert for reference.

2. The expert will rank the so-obtained competing models in terms of their *trust* in them. 
Trust is a subjective measure decided by the expert.
We will only direct the expert to take into account the level of `accuracy` and `simplicity` to a reasonable extent.
The expert is free to interpret these measurements as they please.
The expert may, e.g., consider a subjective and not well-defined notion of "`soundess`" to be most important.
For example, for two models `m_1` and `m_2`, the expert may deem them to be equivalently accurate even though `accuracy_1 > accuracy_2`; moreover, even if `m_1` may have a smaller number of components than `m_2`, the expert may decide that `m_2` is a better model because of the nature of the components in use as well as the way they are combined (e.g., `m_2` contains a realization of the Body Mass Index for a medical problem where the patient's `weight` and `height` are deemed to be important while `m_1` contains unintuive operations, e.g., `atan(log(sqrt(age))/height)`).

3. Through discussions it was determined that the best way to rank the methods was by incorporating all three scores, as in stage 1. So, the winning SR method is the one whose model is ranked 1st in terms of the harmonic mean of the expert score, r2_score, and simplicity. Ranks across different data sets are averaged to obtain a final winner.


# Winners

|    | algorithm     |   trust_score |
|---:|:--------------|--------------:|
|  1 | uDSR          |          5.75 |
|  2 | QLattice      |          5.21 |
|  3 | geneticengine |          4.99 |
|  4 | operon        |          4.8  |
|  5 | Bingo         |          4.66 |
|  6 | pysr          |          4.17 |
|  7 | PS-Tree       |          3.15 |
|  8 | E2ET          |          2.72 |
|  9 | LassoCV       |          3.45 |

*Note: eql runs did not finish successfully and hence it is not rated

In [None]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
from pathlib import Path
import sympy as sp


In [None]:
rdir = '../results_stage2/'
datadir = '../experiment/data/stage2/data/'

In [None]:
frames = []
i = 0
for f in Path(rdir).rglob('*.json'):
#     print(f)
    if '7_7' in str(f):
        continue
    if '135' not in str(f):
        continue
    if 'gpzgd' in str(f):
        continue
    with open(f, 'r') as of:
        d = json.load(of)
    frames.append(d)
    i += 1
    
print('loaded',i,'results')
df = pd.DataFrame.from_records(frames)

# linear regression is actually lassocv 
df.loc[df['algorithm']=='LinearRegression','algorithm'] = 'LassoCV'
# fix cutoff dataset names
df['dataset'] = df['dataset'].apply(lambda x: x+'ata' if x.endswith('_d') else x)
########################################
# normalize simplicity
df['simplicity_original'] = (df['simplicity']-df['simplicity'].min())/(df['simplicity'].max()-df['simplicity'].min())
########################################
# time transform
df['time_hr'] = df['time_time']/3600
df['time_mins'] = df['time_time']/60
########################################
# deconstruct dataset names
df['dataset-full'] = df['dataset'].copy()
df['dataset'] = df['dataset'].apply(lambda x: '_'.join(x.split('_')[1:-1]))
df['task'] = df['dataset'].apply(lambda x: x.split('value_')[-1].split('_')[0])
df['horizon'] = df['dataset'].apply(lambda x: 7 if '7' in x else 14)
df['random_state'] = df['dataset-full'].apply(lambda x: x.split('_')[0])
df.head()

In [None]:
METRICS = [
   'mse_train', 
   'mae_train', 
   'r2_train',
   'mse_test', 
   'mae_test', 
   'r2_test', 
   'accuracy', 
   'feature_absence_score' 
]

In [None]:
df.algorithm.unique()

In [None]:
df['task'].unique()

# check run completion

In [None]:
df.groupby(['dataset','algorithm'])['random_state'].count().unstack()

In [None]:
order = df.groupby(['algorithm'])['r2_test'].mean().sort_values(ascending=False).index
df.groupby(['task','horizon','algorithm'])['r2_test'].median().unstack().round(3)[order]

In [None]:
df.groupby(['algorithm'])['r2_test'].median().sort_values(ascending=False).round(3)

In [None]:
order = df.groupby(['algorithm'])['simplicity'].mean().sort_values(ascending=False).index
df.groupby(['task','horizon','algorithm'])['simplicity'].mean().unstack().round(3)[order]

In [None]:
# df.groupby(['algorithm'])['simplicity'].mean().sort_values(ascending=False).round(3)

In [None]:
# model_sel
import pdb
median_scores= df.groupby(['task','horizon','algorithm'])['r2_test'].median().unstack()
sel_model = {}
frames = []
for idx, row in median_scores.iterrows():
#     task = row['task'], row['horizon'], row['algorithm'], 
    for alg in row.index:
        if np.isnan(row[alg]): continue
        dfg = df.loc[
            (df.task==idx[0])
            & (df.horizon==idx[1])
            & (df.algorithm==alg)
        ]
        entry = dfg.loc[(dfg.r2_test-row.loc[alg]).abs().idxmin()]
        assert isinstance(entry, pd.Series)
        frames.append(entry.to_dict())
df_best = pd.DataFrame.from_records(frames)

In [None]:
df_best.loc[df_best.algorithm=='QLattice','symbolic_model']

# evaluate models

Note: this was redone from the raw json results due to aggressive rounding in the initial runs. The raw count inputs are quite large, so a larger cutoff was used when rounding floating point numbers in the code below. 

In [None]:
from evaluation import get_symbolic_model, simplicity, round_floats

def redo_model(x):
    seed = x['random_state'] #.values[0]
    ds = x['dataset'] #.values[0]
    task = x['task'] #.values[0]
    dataset = pd.read_csv(f'../experiment/data/stage2/data/{seed}_{ds}_train.csv')
#     '../experiment/data/stage2/data/'
#     print(f'../experiment/data/stage2/data/{seed}_{ds}_{task}_train.csv')
    feature_names = [k for k in dataset.columns if k!= task]
    local_dict = {k:sp.Symbol(k) for k in feature_names}
    if x['algorithm'] == 'QLattice':
        print('symbolic_model:',
              x.symbolic_model)
    mdl = str(get_symbolic_model(x.symbolic_model, local_dict=local_dict, simplify=False))
    simp = simplicity(mdl,feature_names,simplify=False)
    if x['algorithm'] == 'QLattice':
        print('get_symbolic_model:',
              mdl)
    x['simplified_model'] = round_floats(mdl, 6)
    x['simplicity'] = simp
#     if x['algorithm'] == 'QLattice': pdb.set_trace()
    return x
   
df_best = df_best.transform(lambda x: redo_model(x), axis=1) 
# normalize simplicity
df_best['simplicity'] = (df_best['simplicity']-df_best['simplicity'].min())/(df_best['simplicity'].max()-df_best['simplicity'].min())

df_best[['task','simplified_model','simplicity','algorithm']]

# add expert scores

Expert scores were determined by a subject expert on COVID-19 after reviewing each model. 
These scores can be found here: https://docs.google.com/presentation/d/1zVb4HqImP4nB_alrkoQmui16mC8SOxlzZ2uHP3vTltg/edit?usp=sharing

In [None]:
expert_scores = {
    'cases': {
        'Bingo': 3,
        'E2ET': 2,
        'LassoCV': 1,
        'PS-Tree': 3,
        'QLattice': 4,
        'geneticengine': 3,
        'operon': 5,
        'pysr': 3,
        'uDSR': 4
    },
    'hosp': {
        'Bingo': 4,
        'E2ET': 2,
        'LassoCV': 1,
        'PS-Tree': 3,
        'QLattice': 5,
        'geneticengine': 5,
        'operon': 4,
        'pysr': 4,
        'uDSR':5 
    },
    'deaths': {
        'Bingo': 4,
        'E2ET': 2,
        'LassoCV': 1,
        'PS-Tree': 3,
        'QLattice': 5,
        'geneticengine': 4,
        'operon': 4,
        'pysr': 3,
        'uDSR': 4
    }
}
frames = []
for k,v in expert_scores.items():
    for alg,score in v.items():
        data = {}
        data['task'] = k
        data['algorithm'] = alg
        data['expert_score'] = score
        frames.append(data)
         
expertdf = pd.DataFrame.from_records(frames)
# display(expertdf)
df_final = df_best.merge(expertdf, on=['task','algorithm'])
len(df_final)

In [None]:
expertdf.groupby('algorithm').mean().round(3).sort_values(by='expert_score')

In [None]:
dfm = df_final[['algorithm','task','r2_test','simplicity','expert_score']].melt(id_vars = ['algorithm','task'])
print('r2_test')
display(
    dfm.loc[dfm.variable=='r2_test'].groupby(['algorithm','task'])['value'].median().round(2).unstack()
)
print('simplicity')
display(
    dfm.loc[dfm.variable=='simplicity'].groupby(['algorithm','task'])['value'].median().round(2).unstack()
)
print('expert_score')
display(
    dfm.loc[dfm.variable=='expert_score'].groupby(['algorithm','task'])['value'].median().round(2).unstack()
)

# calculate metric ranks

calculate trust score, defined as harmonic mean of r2, simplicity, and expert score. 

In [None]:
mets = ['r2_test','simplicity','expert_score']
rank_mets = []
df_sum = (df_final.groupby(['algorithm','task'],as_index=False)
          [mets]
#           .mean()
          .max()
         )
n_algs = df_final.algorithm.nunique()
df_sum['r2_test'] = df_sum['r2_test'].round(3)
# df_sum['simplicity'] = df_sum['simplicity'].round(3)
for col in mets:
    ascending=False
    colname = col+'_rank' 
    df_sum[colname]=(n_algs-df_sum
                     .groupby('task')
                     [col]
                     .rank(ascending=ascending, 
                           method='dense')
                    ) 
    assert df_sum[colname].max() <= df_sum.algorithm.nunique()
    rank_mets.append(colname)

"""
compute harmonic mean
"""
from scipy import stats
trust_score = (df_sum.groupby(['task','algorithm'])[rank_mets]
               .apply(lambda x: stats.hmean(x.values[0]))
#                .apply(lambda x: np.mean(x.values[0]))
               .rename('trust_score') 
              )
df_sum = df_sum.merge(trust_score, on=['task','algorithm'])

# display winners

In [None]:
winners = df_sum.groupby(['algorithm'])['trust_score'].mean().sort_values(ascending=False)
order = winners.index 
print(df_sum.groupby('algorithm').mean().loc[order]['trust_score'].round(2).reset_index().to_markdown())

In [None]:
mets = rank_mets + ['trust_score']

dfm = df_sum[['algorithm','task']+mets].melt(id_vars = ['algorithm','task'])
dfm.groupby(['task','variable','algorithm'])['value'].max().unstack().round(2)[order]