You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the great work and sharing with a great documentation.
I wanted to reproduce the statistics in Table 13 with using word_tokenize function from nltk and sklearn's CountVectorizer, but I could not.
The problems which I observed are,
the number of occurrences of the clue words which I get do not match with yours,
the clue word doesn't tokenized as does and n't by word_tokenize. I suspect that you have used another tokenization method while generating this statistics.
Can you please help me to get the same results?
The below is a minimal code to reproduce my results.
import os
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
ext_train_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_train.tsv", sep='\t')
ext_eval_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_valid.tsv", sep='\t')
ext_test_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_test.tsv", sep='\t')
df = pd.concat([ext_train_df, ext_eval_df, ext_test_df])
with open(f"{AMAZON_DATA_PATH}/clue_words/counterfactual_clue_words_en.txt")) as f:
clue_words = f.readlines()
clue_words = [clue_word.strip() for clue_word in clue_words]
corpus = df.sentence.values
vectorizer = CountVectorizer(tokenizer=word_tokenize)
X = vectorizer.fit_transform(corpus)
for i, clue_word in enumerate(clue_words):
arr = X[:, vectorizer.vocabulary_.get(clue_word)].toarray()
freq = len(np.nonzero(arr.flatten())[0])
print(f"{i}. {clue_word} => {freq}")
The text was updated successfully, but these errors were encountered:
Hey!
Thanks for the great work and sharing with a great documentation.
I wanted to reproduce the statistics in
Table 13
with usingword_tokenize
function fromnltk
andsklearn
'sCountVectorizer
, but I could not.The problems which I observed are,
doesn't
tokenized asdoes
andn't
byword_tokenize
. I suspect that you have used another tokenization method while generating this statistics.Can you please help me to get the same results?
The below is a minimal code to reproduce my results.
The text was updated successfully, but these errors were encountered: