Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English clue word statistics #1

Open
dopc opened this issue Jul 23, 2022 · 0 comments
Open

English clue word statistics #1

dopc opened this issue Jul 23, 2022 · 0 comments

Comments

@dopc
Copy link

dopc commented Jul 23, 2022

Hey!

Thanks for the great work and sharing with a great documentation.


I wanted to reproduce the statistics in Table 13 with using word_tokenize function from nltk and sklearn's CountVectorizer, but I could not.

The problems which I observed are,

  • the number of occurrences of the clue words which I get do not match with yours,
  • the clue word doesn't tokenized as does and n't by word_tokenize. I suspect that you have used another tokenization method while generating this statistics.

Can you please help me to get the same results?


The below is a minimal code to reproduce my results.

import os
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer


ext_train_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_train.tsv", sep='\t')
ext_eval_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_valid.tsv", sep='\t')
ext_test_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_test.tsv", sep='\t')
df = pd.concat([ext_train_df, ext_eval_df, ext_test_df])

with open(f"{AMAZON_DATA_PATH}/clue_words/counterfactual_clue_words_en.txt")) as f:
    clue_words = f.readlines()
clue_words = [clue_word.strip() for clue_word in clue_words]


corpus = df.sentence.values
vectorizer = CountVectorizer(tokenizer=word_tokenize)
X = vectorizer.fit_transform(corpus)


for i, clue_word in enumerate(clue_words):
  arr = X[:, vectorizer.vocabulary_.get(clue_word)].toarray()
  freq = len(np.nonzero(arr.flatten())[0])
  print(f"{i}. {clue_word} => {freq}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant