Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to produce predictions? #59

Closed
metin-akyol opened this issue Apr 22, 2023 · 9 comments
Closed

How to produce predictions? #59

metin-akyol opened this issue Apr 22, 2023 · 9 comments

Comments

@metin-akyol
Copy link

metin-akyol commented Apr 22, 2023

For my use case, I would like to obtain for each qid, the highest and lowest ranked observations, identified by the unique_ID.

I have a created a minimum reproducible example that has purposefully a perfect relationship between the feature that is used to predict and the corresponding label so we can test whether the algorithm works correctly (indeed for large enough data I do get ndcg=1.0, so it appears to work correctly).

I have not been able to merge my predicted ranks back to the original dataset in the correct order. The slates_y is not in an order that matches my test_df. Is there any way how I can match the slates_y tensor back to the test_df in the correct order? That is, where each row matches the correct unique_ID in test_df?

For illustrative purposes, I use a small dataset:

import numpy as np
import pandas as pd

num_qid = 10
num_obs_per_qid =10
numRows = num_qid * num_obs_per_qid
numRanks= 5


df = pd.DataFrame({
    "qid":[i for i in range(num_qid) for j in range(num_obs_per_qid)],
    "uniqueID":num_qid*list(range(num_obs_per_qid)),
    "feature":np.random.random(size=(numRows,))
})

#df['label'] = pd.qcut(df["feature"], q=5, labels=False, precision=0, duplicates='raise')
df['label'] = df.groupby("qid")["feature"].apply(lambda x: pd.qcut(x, q=num_ranks, labels=False, precision=0, duplicates='raise'))

train_rows = round(0.7 * num_qid)
vali_rows  = round(0.8 * num_qid)

train = df[df['qid']<=train_rows]
vali  = df[(df['qid']>train_rows)&(df['qid']<=vali_rows)]
test  = df[(df['qid']>vali_rows)]

I use the code below to produce predictions, but cannot make sense of the order of slates_y, so I am unable to merge it back to test_df in a correct order.



def df_to_libsvm(df: pd.DataFrame, folderName, fileName):
    x = df[['feature']]
    y = df['label']
    query_id  = df['qid']
    dump_svmlight_file(X=x, y=y, query_id= query_id, f=f'{folderName}/{fileName}.txt', zero_based=True)


df_to_libsvm(train, 'train_data', 'train')
df_to_libsvm(vali, 'train_data', 'vali')
df_to_libsvm(test, 'test_data', 'test')
df_to_libsvm(test, 'test_data', 'vali')


parser = ArgumentParser("allRank")

parser.add_argument("--job-dir", help="Base output path for all experiments", required=False, default = "test_run")

parser.add_argument("--run-id", help="Name of this run to be recorded (must be unique within output dir)", required=False, default = "test_run")

parser.add_argument("--config-file-name", type=str, help="Name of json file with config", required=False, default = "../scripts//local_config.json")

# this 'args=[]' needs to be added within the paranthesis
args = parser.parse_args(args=[])
paths = PathsContainer.from_args(args.job_dir, args.run_id, args.config_file_name)
# reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)
create_output_dirs(paths.output_dir)
logger = init_logger(paths.output_dir)
#ogger.info(f"created paths container {paths}")

# read config
config = Config.from_json(paths.config_path)
logger.info("Config:\n {}".format(pformat(vars(config), width=1)))

output_config_path = os.path.join(paths.output_dir, "used_config.json")

#Notice that 'cp' is a Unix/Linux command, in Windows replace with 'copy' instead of 'cp'
execute_command("cp {} {}".format(paths.config_path, output_config_path))

#have to add '..' to this path, can't do this in the local_config.json file directly.config.data.path ='../allrank/dummy_data'
config.data.path = '../allrank/train_data'
# train_ds, val_ds
train_ds, val_ds = load_libsvm_dataset(
    input_path=config.data.path,
    slate_length=config.data.slate_length,
    validation_ds_role=config.data.validation_ds_role,
)

n_features = train_ds.shape[-1]
assert n_features == val_ds.shape[-1], "Last dimensions of train_ds and val_ds do not match!"

# train_dl, val_dl
train_dl, val_dl = create_data_loaders(
    train_ds, val_ds, num_workers=config.data.num_workers, batch_size=config.data.batch_size)

# gpu support
dev = get_torch_device()
logger.info("Model training will execute on {}".format(dev.type))

# instantiate model
model = make_model(n_features=n_features, **asdict(config.model, recurse=False))
if torch.cuda.device_count() > 1:
    model = CustomDataParallel(model)
    logger.info("Model training will be distributed to {} GPUs.".format(torch.cuda.device_count()))
model.to(dev)

# load optimizer, loss and LR scheduler
optimizer = getattr(optim, config.optimizer.name)(params=model.parameters(), **config.optimizer.args)
loss_func = partial(getattr(losses, config.loss.name), **config.loss.args)
if config.lr_scheduler.name:
    scheduler = getattr(optim.lr_scheduler, config.lr_scheduler.name)(optimizer, **config.lr_scheduler.args)
else:
    scheduler = None

with torch.autograd.detect_anomaly() if config.detect_anomaly else dummy_context_mgr():  # type: ignore
    # run training
    result = fit(
        model=model,
        loss_func=loss_func,
        optimizer=optimizer,
        scheduler=scheduler,
        train_dl=train_dl,
        valid_dl=val_dl,
        config=config,
        device=dev,
        output_dir=paths.output_dir,
        tensorboard_output_path=paths.tensorboard_output_path,
        **asdict(config.training)
    )
#have to add '..' to this path, can't do this in the local_config.json file directly.config.data.path ='../allrank/dummy_data'
config.data.path = '../allrank/test_data'
# train_ds, val_ds
test_ds, val_ds = load_libsvm_dataset(
    input_path=config.data.path,
    slate_length=config.data.slate_length,
    validation_ds_role=config.data.validation_ds_role,
    name_of_file= "test"
)
test_dl, val_dl = create_data_loaders(test_ds, val_ds, num_workers=config.data.num_workers, batch_size=config.data.batch_size)

slates_X, slates_y = __rank_slates(test_dl, model)

@metin-akyol metin-akyol changed the title How to merge predictions with test dataframe? How to produce predictions? May 15, 2023
@metin-akyol
Copy link
Author

metin-akyol commented May 15, 2023

In particular, if we stick to the example above with

num_qid = 10
num_obs_per_qid =10

Then slates_X looks like this:

tensor([[[0.1366],
         [0.2562],
         [0.2953],
         [0.2965],
         [0.3226],
         [0.4198],
         [0.5528],
         [0.6115],
         [0.7089],
         [0.8487]]])

and slates_y looks like this:
tensor([[0., 0., 1., 1., 2., 2., 3., 3., 4., 4.]])

So it appears the tensors are sorted (but for datasets with more features this does not seem the case, or it is not clear after which variable it is sorted)

My test dataframe looks like this:


	qid	uniqueID	feature	label
90	9	0	0.295291	1
91	9	1	0.322551	2
92	9	2	0.848670	4
93	9	3	0.136621	0
94	9	4	0.708911	4
95	9	5	0.552820	3
96	9	6	0.296510	1
97	9	7	0.419781	2
98	9	8	0.256207	0
99	9	9	0.611514	3

But it is not clear to me, how I can assign the slates_y back to the test df in the correct order?

Because when I do this, I clearly get them in an incorrect order:

y_pred = pd.DataFrame(slates_y.numpy())
y_pred_long = y_pred.stack().reset_index()
test['y_pred'] =y_pred_long[0].values

qid	uniqueID	feature	label	y_pred
90	9	0	0.493796	2	0.0
91	9	1	0.522733	3	0.0
92	9	2	0.427541	2	1.0
93	9	3	0.025419	0	1.0
94	9	4	0.107891	1	2.0
95	9	5	0.031429	0	2.0
96	9	6	0.636410	4	3.0
97	9	7	0.314356	1	3.0
98	9	8	0.508571	3	4.0
99	9	9	0.907566	4	4.0

In this specific case, it appears that simply sorting it prior to adding y_pred would do that trick, but that doesn't seem to work for cases with more features where the relationship is not as strong.

@metin-akyol
Copy link
Author

metin-akyol commented May 15, 2023

If slates_X and slatex_y are in the same order, I could technically merge those together, and then merge them with the original test dataframe using slates_X as the variable to merge on, since that variable should appear in both, but this would only work if all the features, i.e. X and y are re-ordered in the same way, which I am not sure of.

The reason I need to merge those back to the test dataframe is because for my usecase, I need to know for each qid, what the highest and lowest ranked observations are, so somehow, I need to get back to having qid and uniqueID.

@metin-akyol
Copy link
Author

metin-akyol commented May 15, 2023

The only solution I can think of is to predict one qid at a time, then concatenate slates_X with slates_y (assuming that these are in the same order???) to one dataframe and assign a column indicatin the qid, and then concatenate all the qids.

Then merge this dataframe with the test dataframe using any of the X features (since those should allow me to identify the same rows in each dataframe).

@metin-akyol
Copy link
Author

After testing some more, I noticed that the slates_X is ordered in the same order for all the features. So if I use a dataset with 5 features, I could simply concatenate slates_X and slates_y back together, and then merge this dataframe with the test dataframe using all 5 features for the merge as identifiers since they will be matching in each dataframe, since that should allow to uniquely match them (the chances of any row having 5 identical values for the 5 features are very slim). So something like this (pseudo code):

tmp_df = pd.concat(slates_y,slates_X)
merged_df = pd.merge(test, tmp_df, on=['feature0', 'feature1','feature2','feature3','feature4']

@niccola-tartaglia
Copy link

niccola-tartaglia commented May 16, 2023

Your approach seems to make sense but I believe the function __rank_slates does not do what you think it does? If you look at the function definition it is not returning the predicted ranks but rather the original y vector just reordered. If you order it back via your merge approach, you probably end up with a perfect prediction score.

You should check if there is a different function that returns the predicted rank or score. I am also curious about this, I posted a separate question about this, as it is a related but different issue.

@metin-akyol
Copy link
Author

metin-akyol commented May 16, 2023

Thank you for your response, I believe you are correct. After replacing the label in the test data with a random variable, I get y_slates that matches that random variable and the dataset is ordered in the same order as this random variable. It appears that the model looks at the label in the test dataset but that seems strange to me, since it is supposed to predict this.

My point here is, if I don't have any values for y, how is the model going to predict y?

In my example above, I provide the feature as an input which has a perfect relationship with the label that the model should predict. But when I provide a random variable for y during testing, the model fails to predict correctly.

It is not not clear to me how this works out of sample, when I don't have the correct y for my test data.

Would it make a difference if I used rank_slates instead of __rank_slates?

@niccola-tartaglia
Copy link

Looking at this again, it appears to me that __rank_slates is ordering the data according to the model's predicted score, below is the relevant section from __rank_slates.

The model.score function uses the true y vector as an input for some reason (input_indec) but I believe that is only to generate a vector the same length as y_true (according to ones_like description from torch library).

    with torch.no_grad():
        for xb, yb, _ in dataloader:
            X = xb.type(torch.float32).to(device=device)
            y_true = yb.to(device=device)

            **input_indices = torch.ones_like(y_true).type(torch.long)**
            mask = (y_true == losses.PADDED_Y_VALUE)
            **scores = model.score(X, mask, input_indices)**

            scores[mask] = float('-inf')

            **_, indices = scores.sort(descending=True, dim=-1)**
            indices_X = torch.unsqueeze(indices, -1).repeat_interleave(X.shape[-1], -1)
            reranked_X.append(torch.gather(X, dim=1, index=indices_X).cpu())
            reranked_y.append(torch.gather(y_true, dim=1, index=indices).cpu())

If I understand this correctly, you should be able to infer from the order of the slates_y the ranking. The slates_y should be the same as your true y vector, just in the same order as slates_y.

@niccola-tartaglia
Copy link

So I would recommend some type of reverse engineering, where you create an ordered index from your slates_y before you merge it back to test and then use that index as your predicted rank.

@metin-akyol
Copy link
Author

Thank you Niccala, this worked. I reset the index of the dataframe version of slates_X and this index represents the ranked items for each qid (so it runs from 1 to numObsPerQid) and then I applied qcut to get this converted to the number of ranks I need for my purpose. I highly appreciate the help!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants