Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correction of mistakes associated with co-occurrence gathering and collection parsing #947

Merged
merged 16 commits into from
Feb 8, 2019

Conversation

MichaelSolotky
Copy link
Contributor

@MichaelSolotky MichaelSolotky commented Feb 2, 2019

Major changes:

  1. The command line key "--threads" didn't affect parsing of a collection, because it hasn't been sending in collection parsing config (fixed).
  2. Added mutexes places of access to shared resources like paths to co-occurrence batches in the target dir and token map.
  3. Set more correct maximal number of files that a process is allowed to hold open simultaneously (my experiments with MacOS showed that this number is about 251 files, it's a pity, but I couldn't find any purpose of that. A process was just refusing to open any more files after exceeding this limit and no exceptions had been throwing, the program was finishing correctly, but with the wrong output). Also, I didn't use the value 251, but 241 in order to prevent crashes on other platforms.
  4. Explicitly limit the number of parallel threads in merging of co-occurrence batches (if it would be greater than maximal number of open files / 3, that would be ineffective, because in merging frequent opens and closes of files will be needed, but it's an expensive operation
  5. Initialize uninitialized variables in BufferOfCooccurrences
  6. Make vocab parsing more powerful (now it can parse files with commas that divide tokens and modalities)
  7. Fixed determination of current modality in text
  8. Fixed one more trouble with AppVeyor-mingw that suddenly appeared: due to caching of some packages one of them couldn't be upgraded. The installation process was trying to create a dir with a special name "/var/lib/pacman/local/msys2-runtime-2.11.2-1/" which existed due to caching. More details you can find here: https://ci.appveyor.com/project/bigartm/bigartm-542mi/builds/22092938/job/scnqe4nle8fn3eto
  9. Bump version to 0.10.0 due to the bug-fix and no interface change

Minor changes:

  1. Renamed some variables in order to better represent their sense (like config_ -> collection_parser_config_ since there is a co-occurrence collector config, ResultingBufferOfCooccurrences -> BufferOfCooccurrences since this buffer can be used as intermediate storage)
  2. Corrected some grammatical mistakes in comments
  3. Removed the function CooccurrenceCollector::ReadVowpalWabbit since it's not used, instead, the function from the collection_parser is used to read and parse VowpalWabbit files
  4. Rewrote more clearly the function CooccurrenceCollector::FirstStageOfMerging with lots of comments
  5. Removed rudiments of opening-closing files during the KWayMerge procedure

Copy link
Contributor

@MelLain MelLain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@MichaelSolotky MichaelSolotky merged commit 8002c1f into bigartm:master Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants