-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to produce predictions? #59
Comments
In particular, if we stick to the example above with
Then slates_X looks like this:
and slates_y looks like this: So it appears the tensors are sorted (but for datasets with more features this does not seem the case, or it is not clear after which variable it is sorted) My test dataframe looks like this:
But it is not clear to me, how I can assign the slates_y back to the test df in the correct order? Because when I do this, I clearly get them in an incorrect order:
In this specific case, it appears that simply sorting it prior to adding y_pred would do that trick, but that doesn't seem to work for cases with more features where the relationship is not as strong. |
If slates_X and slatex_y are in the same order, I could technically merge those together, and then merge them with the original test dataframe using slates_X as the variable to merge on, since that variable should appear in both, but this would only work if all the features, i.e. X and y are re-ordered in the same way, which I am not sure of. The reason I need to merge those back to the test dataframe is because for my usecase, I need to know for each qid, what the highest and lowest ranked observations are, so somehow, I need to get back to having qid and uniqueID. |
The only solution I can think of is to predict one qid at a time, then concatenate slates_X with slates_y (assuming that these are in the same order???) to one dataframe and assign a column indicatin the qid, and then concatenate all the qids. Then merge this dataframe with the test dataframe using any of the X features (since those should allow me to identify the same rows in each dataframe). |
After testing some more, I noticed that the slates_X is ordered in the same order for all the features. So if I use a dataset with 5 features, I could simply concatenate slates_X and slates_y back together, and then merge this dataframe with the test dataframe using all 5 features for the merge as identifiers since they will be matching in each dataframe, since that should allow to uniquely match them (the chances of any row having 5 identical values for the 5 features are very slim). So something like this (pseudo code): tmp_df = pd.concat(slates_y,slates_X) |
Your approach seems to make sense but I believe the function __rank_slates does not do what you think it does? If you look at the function definition it is not returning the predicted ranks but rather the original y vector just reordered. If you order it back via your merge approach, you probably end up with a perfect prediction score. You should check if there is a different function that returns the predicted rank or score. I am also curious about this, I posted a separate question about this, as it is a related but different issue. |
Thank you for your response, I believe you are correct. After replacing the label in the test data with a random variable, I get y_slates that matches that random variable and the dataset is ordered in the same order as this random variable. It appears that the model looks at the label in the test dataset but that seems strange to me, since it is supposed to predict this. My point here is, if I don't have any values for y, how is the model going to predict y? In my example above, I provide the feature as an input which has a perfect relationship with the label that the model should predict. But when I provide a random variable for y during testing, the model fails to predict correctly. It is not not clear to me how this works out of sample, when I don't have the correct y for my test data. Would it make a difference if I used rank_slates instead of __rank_slates? |
Looking at this again, it appears to me that __rank_slates is ordering the data according to the model's predicted score, below is the relevant section from __rank_slates. The model.score function uses the true y vector as an input for some reason (input_indec) but I believe that is only to generate a vector the same length as y_true (according to ones_like description from torch library).
If I understand this correctly, you should be able to infer from the order of the slates_y the ranking. The slates_y should be the same as your true y vector, just in the same order as slates_y. |
So I would recommend some type of reverse engineering, where you create an ordered index from your slates_y before you merge it back to test and then use that index as your predicted rank. |
Thank you Niccala, this worked. I reset the index of the dataframe version of slates_X and this index represents the ranked items for each qid (so it runs from 1 to numObsPerQid) and then I applied qcut to get this converted to the number of ranks I need for my purpose. I highly appreciate the help!!! |
For my use case, I would like to obtain for each qid, the highest and lowest ranked observations, identified by the unique_ID.
I have a created a minimum reproducible example that has purposefully a perfect relationship between the feature that is used to predict and the corresponding label so we can test whether the algorithm works correctly (indeed for large enough data I do get ndcg=1.0, so it appears to work correctly).
I have not been able to merge my predicted ranks back to the original dataset in the correct order. The slates_y is not in an order that matches my test_df. Is there any way how I can match the slates_y tensor back to the test_df in the correct order? That is, where each row matches the correct unique_ID in test_df?
For illustrative purposes, I use a small dataset:
I use the code below to produce predictions, but cannot make sense of the order of slates_y, so I am unable to merge it back to test_df in a correct order.
The text was updated successfully, but these errors were encountered: