Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor change in the code #7

Closed
shrinidhin opened this issue Mar 22, 2022 · 2 comments
Closed

Minor change in the code #7

shrinidhin opened this issue Mar 22, 2022 · 2 comments

Comments

@shrinidhin
Copy link

Hi!I noticed that in the following line of code in the preprocess_clausecat.py file, at line 61 in the for loop while splitting the dataset into train and test set

for label in label_dict:
        split = int(len(label_dict[label]) * eval_split)
        train += label_dict[label][split:]
        dev += label_dict[label][:split]
        checksum += len(label_dict[label])
        table_data.append(
            (
                label,
                len(label_dict[label]),
                len(label_dict[label][split:]),
                len(label_dict[label][:split]),
            )
        )

the train and dev assignment statements need to be interchanged. As per the existing assignment, The train set has fewer samples than the dev set. Shouldn't it be the other way round?
Something like this?

train += label_dict[label][:split]
dev += label_dict[label][split:]
@thomashacker
Copy link
Contributor

Ah yes, I see why it can be a little confusing but I think the code seems right.
Let's have a look at a small example:

split = int(len(label_dict[label]) * eval_split)

Let's say len(label_dict[label]) = 100
and eval_split = 0.2 (20%)

Then we'd get split = 20

for dev += label_dict[label][:20] we would get the first 20 elements (0->19)
for train += label_dict[label][20:] we would get everything after the first 20 elements (20->len(label_dict[label])-1)

So this way we'd end up with a train (80%) and dev (20%) split.

@shrinidhin
Copy link
Author

Okay. So eval_split should be the percentage of split for the dev set right?Meaning out of 100 % data, If I want the split to be 70:30, then i need to give a value of 0.3 for eval_split.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants