Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble running large dataset #54

Closed
YValarieAnne opened this issue Mar 3, 2023 · 4 comments
Closed

Trouble running large dataset #54

YValarieAnne opened this issue Mar 3, 2023 · 4 comments

Comments

@YValarieAnne
Copy link

Hello,

Thank you for publishing this code in python on github and for the support!
I am working on completing my dissertation using scanpaths and binary comparison works, although the processing is long.
Rob Newport at https://github.com/robnewport/SoftMatch has been wonderful in supporting this large dataset, but the process is long using his system in Matlab as well. I need this output to finish my analyses and defend in 30 days.

I need any suggestions you may have on running a very large dataset. This takes approx 40 min per binary comparison, and I have 50 participants that need cross-comparisons between two conditions for 5 scenario runs.

Do you have any suggestions on how to process these comparisons in a shorter time?
Each participant file has over 30,000 rows of fixations on x-y coordinates.

Thanks in advance,

Valarie

@adswa
Copy link
Owner

adswa commented Mar 3, 2023

Hi Valarie,

Could you parallelize the computations by e.g., submitting compute jobs via a job scheduler on a compute cluster?

@adswa
Copy link
Owner

adswa commented Mar 3, 2023

Depending on your experimental paradigm, it may also make sense to split the 30k rows into shorter time chunks. I haven't worked a lot on the topic of gaze paths and gaze path comparisons, and only you know what would work for your experiment, but if the scan paths you compare are 30k lines long, I would suspect that similarities between them get distorted/diminished as a side effect of the long vector length.

@YValarieAnne
Copy link
Author

Hello, Thank you for your kind response! I am working to use the Matlab parallel processes now. No, chunking the snanpaths rows into sections would remove some of the difference/similarity analyses as I cannot see to the millisecond where each participant was in the scenario script but can view it from a high-level picture, which is what I am looking to achieve here.
I was hoping someone had some experience with this large of a data set. I will update this with the solution that works, in case others are met with this conundrum in the future . Many thanks, Valarie

@adswa
Copy link
Owner

adswa commented Mar 3, 2023

I'm glad you seem to have found a solution for your problem :) I'll close this issue as there is nothing I can do in this tool box at this moment, but do feel free to reopen this issue at a later point, or open a new one if something comes up. :)

@adswa adswa closed this as completed Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants