Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Word2vec.c #19

Open
mereogeometry opened this issue Feb 25, 2018 · 1 comment
Open

Question about Word2vec.c #19

mereogeometry opened this issue Feb 25, 2018 · 1 comment

Comments

@mereogeometry
Copy link

Hi,
We're using word2vec for hypernymy discovering. In order to design a more efficient version of word2vec, we need to know what is exactly the semantics of the variable "c" in function ReadVocab() within the file word2vec.c? Thanks in advance.
void ReadVocab() {
long long a, i = 0;
char c;
char word[MAX_STRING];
FILE *fin = fopen(read_vocab_file, "rb");
if (fin == NULL) {
printf("Vocabulary file not found\n");
exit(1);
}
for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
vocab_size = 0;
while (1) {
ReadWord(word, fin);
if (feof(fin)) break;
a = AddWordToVocab(word);
fscanf(fin, "%lld%c", &vocab[a].cn, &c); // semantics of c?
i++;
}
SortVocab();
if (debug_mode > 0) {
printf("Vocab size: %lld\n", vocab_size);
printf("Words in train file: %lld\n", train_words);
}
fin = fopen(train_file, "rb");
if (fin == NULL) {
printf("ERROR: training data file not found!\n");
exit(1);
}
fseek(fin, 0, SEEK_END);
file_size = ftell(fin);
fclose(fin);
}

@Neutrinoant
Copy link

Hi mereogeometry, I'd analyzed the word2vec.c for further study.
The format of the binary file 'read_vocab_file' is the following:
word1Acount1Bword2Acount2B......
where A and B are whitespace characters (binary) such as '\t' or '\n' or etc.
For example, the function ReadWord() reads 'word1' and 'A', then fscanf() reads 'count1' and 'B', 'B' is assigned to the variable c, like a trash variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants