-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed every time i try the "conda env create -f environment.yml", it will stop in solving environment here in the picture, and finally failed to continue, I remember last year I also met this problem and failed to try blink, and I install all the dependency before running the environment.yml #7
Comments
Hi @YUANMENG-1. I'm able to get the environment to solve on my machine, but it does indeed take a long time. I'll look into improvments I can make to the environment.yml to fix this issue. In the meantime, maybe try using Mamba? It is a drop in replacment for Conda and typically is much faster at building environments. I gave it a try with blink's environment and it finished in just a few minutes:
Let me know how it goes and if you need any additional help getting BLINK going. |
Sorry I met another problem: When running the demo data: python3 -m blink.blink_cli ./example/accuracy_test_data/small.mgf ./example/accuracy_test_data/medium.mgf ./blink2out.csv ./models/positive_random_forest.pickle ./models/negative_random_forest.pickle positive --min_predict 0.01 --mass_diffs 0 14.0157 12.000 15.9949 2.01565 27.9949 26.0157 18.0106 30.0106 42.0106 1.9792 17.00284 24.000 13.97925 1.00794 40.0313 running information seems to be "The warning highlights a potential risk. Using a model trained with scikit-learn version 1.0.2 and loading it with version 1.4.1.post1 might lead to unexpected behavior or invalid results." this problem use "conda install scikit-learn=1.0.2" can solve this: INFO:root:Processing small.mgf
|
Sorry again, this time occurred some new problems:which may be "can not find the charge column?" python3 -m blink.blink_cli ./example/accuracy_test_data/small.mgf ./example/accuracy_test_data/medium.mgf ./blink2out.csv ./models/positive_random_forest.pickle ./models/negative_random_forest.pickle positive --min_predict 0.01 --mass_diffs 0 14.0157 12.000 15.9949 2.01565 27.9949 26.0157 18.0106 30.0106 42.0106 1.9792 17.00284 24.000 13.97925 1.00794 40.0313 INFO:root:Processing small.mgf The above exception was the direct cause of the following exception: Traceback (most recent call last): |
Hi @YUANMENG-1. Sorry you are having issues with the command line implementation. The CLI is still under active development and has some experimental features that are currently outside of the scope of what was published in the BLINK paper. For standard use, we recomend following the examples in the tutorial notebook |
Sorry again to bother you, When I wanted to compare two mgf files by imitating your mzml compared mgf tutorial in your link: The problems in the output file are as follows:
import sys mgf_query = blink.open_msms_file('../SpectralEntropy-master/neg_slaw_modi7.mgf') discretized_spectra = blink.discretize_spectra(mgf_ref.spectrum.tolist(), mgf_query.spectrum.tolist(), mgf_ref.precursor_mz.tolist(), mgf_query.precursor_mz.tolist(),bin_width=0.001, tolerance=0.01, intensity_power=0.5, trim_empty=False, remove_duplicates=False, network_score=False) %%time filtered_S12 = blink.filter_hits(S12, min_matches=5, override_matches=20, min_score=0.6) df.to_csv('output_test.csv', index=False) |
No problem, I'm happy to help. Your code looks good, it seems like the issue is merging in the metadata. I think what you need to do is switch the order of your mgf_query and mgf_ref in blink.discretize_spectra() and try it again. The "query" column in the output dataframe always corresponds to indices of the first set of spectra, while the "ref" column is the indices of the second set of spectra. This is easy to mix up, I do need to improve documentation on how this works. I recently fixed the same issue from the tutorial notebook itself, so check out the most recent version if you have an older one. As far as your other question about the spectra, those look okay to me. It is less obvious when the arrays are converted to strings as a saved csv, but each spectrum is modeled as an array of two arrays. The first array is for m/z, and the second array is for their intensities. For instance, the first "spectrum_ref" entry is the following:
The first list there are the m/z values, and the second are the intensities. Hopefully this helps! |
My guess is that the reason your output is so much larger is that the metadata is now being associated correctly, though there could be something else going on. This appears to be a pretty big comparison, so the pd.merge adds a lot of extra content to the output dataframe. The order of the spectra in the discretize_spectra function shouldn't change the size of the score matrix. The algorithm is more efficient when the smaller set of spectra is first, but it shouldn't make a huge difference (query is typically smaller than ref). This is more of a pandas question than a blink question, however, I can give you some suggestions. If you want to decrease the size of your outputs, you can filter the output dataframe by score or number of matches before adding metadata. If you already filtered those, then you can chose to only associate essential metdata instead of everything read from the mgf files with the merge. For instance:
If I/O and file size is a concern, maybe look into using parquet files or similar, rather than csv. Good luck! |
and whether in HPC or mac all met this
Originally posted by @YUANMENG-1 in #4 (comment)
The text was updated successfully, but these errors were encountered: