English clue word statistics #1

dopc · 2022-07-23T21:31:30Z

Hey!

Thanks for the great work and sharing with a great documentation.

I wanted to reproduce the statistics in Table 13 with using word_tokenize function from nltk and sklearn's CountVectorizer, but I could not.

The problems which I observed are,

the number of occurrences of the clue words which I get do not match with yours,
the clue word doesn't tokenized as does and n't by word_tokenize. I suspect that you have used another tokenization method while generating this statistics.

Can you please help me to get the same results?

The below is a minimal code to reproduce my results.

import os
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer


ext_train_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_train.tsv", sep='\t')
ext_eval_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_valid.tsv", sep='\t')
ext_test_df = pd.read_csv(f"{AMAZON_DATA_PATH}/data/EN-ext_test.tsv", sep='\t')
df = pd.concat([ext_train_df, ext_eval_df, ext_test_df])

with open(f"{AMAZON_DATA_PATH}/clue_words/counterfactual_clue_words_en.txt")) as f:
    clue_words = f.readlines()
clue_words = [clue_word.strip() for clue_word in clue_words]


corpus = df.sentence.values
vectorizer = CountVectorizer(tokenizer=word_tokenize)
X = vectorizer.fit_transform(corpus)


for i, clue_word in enumerate(clue_words):
  arr = X[:, vectorizer.vocabulary_.get(clue_word)].toarray()
  freq = len(np.nonzero(arr.flatten())[0])
  print(f"{i}. {clue_word} => {freq}")

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English clue word statistics #1

English clue word statistics #1

dopc commented Jul 23, 2022

English clue word statistics #1

English clue word statistics #1

Comments

dopc commented Jul 23, 2022