Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text2text model evaluation not working #41

Closed
Khalid-Usman opened this issue Jul 27, 2021 · 13 comments
Closed

text2text model evaluation not working #41

Khalid-Usman opened this issue Jul 27, 2021 · 13 comments
Labels
bug Something isn't working

Comments

@Khalid-Usman
Copy link

Description

Model evaluation is not working properly to output the precision and recall

How to Reproduce?

I run the following line of code,

python3 -m pecos.apps.text2text.evaluate --pred-path ./test-prediction.txt --truth-path ./test.txt --text-item-path ./output-labels.txt

where,
--pred-path is the path of file produced during model prediction,
--truth-path is the path of test file, e.g. Out1, Out2, Out3 \t cheap door
Out1, Out2 and Out3 are the line number in the the following output file
--text-item-path ./output-labels.txt

What have you tried to solve it?

Error message or code output

Traceback (most recent call last):
  File "/home/khalid/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/khalid/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/pecos/apps/text2text/evaluate.py", line 130, in <module>
    do_evaluation(args)
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/pecos/apps/text2text/evaluate.py", line 119, in do_evaluation
    Y_true = smat.csr_matrix((val_t, (row_id_t, col_id_t)), shape=(num_samples_t, len(item_dict)))
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 55, in __init__
    dtype=dtype))
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/scipy/sparse/coo.py", line 196, in __init__
    self._check()
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/scipy/sparse/coo.py", line 285, in _check
    raise ValueError('column index exceeds matrix dimensions')
ValueError: column index exceeds matrix dimensions

Environment

  • Operating system:
  • Python version:
  • PECOS version:

(Add as much information about your environment as possible, e.g. dependencies versions.)

@Khalid-Usman Khalid-Usman added the bug Something isn't working label Jul 27, 2021
@OctoberChang
Copy link
Contributor

@Khalid-Usman , from the error message, it seems like the label_id in your ./test.txt is out of range, which is defined by your provided ./output-labels.txt.

Can you verify the label_ids in the ./test.txt is within the range of ./output-labels.txt?

@Khalid-Usman
Copy link
Author

I don't think so , but let me double check ...

Also instead of this i used the following command and get precision / Recall. Is that correct ?

python3 -m pecos.apps.text2text.evaluate --pred-path ./test-prediction.txt --truth-path ./test.txt

@OctoberChang
Copy link
Contributor

If --text-item-path is not provided, the evaluation script assume your groun truth file ./test.txt has format 1 (https://github.com/amzn/pecos/blob/mainline/pecos/apps/text2text/evaluate.py#L43). If the label IDs in './test.txt' and './train.txt' are aligned, then should be fine.

@Khalid-Usman
Copy link
Author

@OctoberChang I verified , there were few labels_ids in ./test.txt which are greater than the maximum index value of ./output-labels.txt file. So, I removed those labels_ids from ./test.txt file and still getting the same error.

@Khalid-Usman
Copy link
Author

Khalid-Usman commented Jul 28, 2021

@OctoberChang there is something wrong in the evaluation code.

I tried to debug and for ground-truth items, I found error in the following line "

column index exceed matrix dimensions

".

Y_true = smat.csr_matrix((val_t, (row_id_t, col_id_t)), shape=(num_samples_t, len(item_dict)))

I printed each variable and got,

len(item_dict) = 2580153
num_samples_t = 132715
len(val_t) = 273316
len(row_id_t) = 273316
len(col_id_t) = 273316

Moreover,

num_samples_t = num_samples_p

@OctoberChang
Copy link
Contributor

@Khalid-Usman , you should print the max(col_id_t) and check if max(col_id_t) < len(item_dict). If not, then this means that your label_id still out-of-scope from the pre-defined label set, as specified in your ./output-labels.txt.

@Khalid-Usman
Copy link
Author

@OctoberChang , yes max(col_id_t) < len(item_dict) but i found max index in ./output-labels.txt file and removed all rows from ./test.txt containing indices greater than max index of ./output-labels.txt.

So I don't think so, there exist any label_id that has no corresponding line number in ./output-labels.txt.

Thanks, please verify that.

@OctoberChang
Copy link
Contributor

@Khalid-Usman , can you try just evaluating the first line of your prediction ./test-prediction.txt against the first line of the ground truth ./test.txt given the pre-defined label set ./output-labels.txt?

If this still not working, you can share the first line of those two files as well as the output-labels.txt with me, and i can take a look perhaps.

@Khalid-Usman
Copy link
Author

Khalid-Usman commented Jul 29, 2021

@OctoberChang what is the score in ./test-prediction.txt e.g. for my query i got 20 related documents with scores sorted in descending w.r.t. scores. So these 20 retrieval documents are actually precision for top-k (top-20) ? Or this score has nothing to do with precision and we have to calculate ourselves by matching these documents with the ground truth ?

@OctoberChang
Copy link
Contributor

the prediction score matrix (i.e., Y_pred) from text2text models are not related to the ground truth labels.
In other word, to get precision@k (e.g., a function of Y_true and Y_pred), you also need to match Y_pred with Y_true.

@simonhughes22
Copy link

simonhughes22 commented Jul 30, 2021

I found the recall numbers output by that function to be wildly different from what i was expecting. So i did the calculation myself and got much higher numbers. If a query has 4 labels, and if the top result is in the recall set, for recall@1 do you compute that as 1.0 or 0.25? It could also be that i am having a similar label misalignment issue.

@OctoberChang
Copy link
Contributor

OctoberChang commented Jul 30, 2021

@simonhughes22 , In your example, the Recall@1 should be 0.25 while the Prec@1 is 1.0. See the definition of Prec@k and Recall@k in http://manikvarma.org/downloads/XC/XMLRepository.html#metrics

@OctoberChang
Copy link
Contributor

Closing this issue. Feel free to reopen if you still have any questions related to the text2text evaluation module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants