Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KerasTokenizer.fitOnTexts doesn't preserve the words sorting #7448

Closed
MarcoGhise opened this issue Apr 5, 2019 · 2 comments

Comments

Projects
None yet
4 participants
@MarcoGhise
Copy link

commented Apr 5, 2019

Issue Description

With python library from keras.preprocessing import text and I execute che commands
train_posts = ['we play to grow up togheter', 'today I have had an happy to do a lot of nice things such machine learning']
follows by
tokenize.fit_on_texts(train_posts)

I get this result for Word Indexes.
{'togheter': 6, 'play': 3, 'of': 16, 'happy': 12, 'a': 14, 'to': 1, 'machine': 20, 'do': 13, 'grow': 4, 'up': 5, 'we': 2, 'today': 7, 'nice': 17, 'an': 11, 'i': 8, 'lot': 15, 'have': 9, 'such': 19, 'had': 10, 'learning': 21, 'things': 18}

Unexpectedly, using java version with KerasTokenizer

KerasTokenizer tokenize = new KerasTokenizer(1000);
String[] itemsArray = new String[] { "we play to grow up togheter", "today I have had an happy to do a lot of nice things such machine learning" };
tokenize.fitOnTexts(itemsArray);

I get
{play=2, a=3, grow=4, happy=5, i=6, had=7, learning=8, do=9, an=10, we=11, nice=12, togheter=13, lot=14, such=15, machine=16, today=17, of=18, have=19, things=20, to=1, up=21}

That's a problem when I transform my sentence with textsToMatrix.

It'd be enough to change these lines into KerasTokenizer; from

private Map<String, Integer> wordCounts = new HashMap<>();
private HashMap<String, Integer> wordDocs = new HashMap<>();

to

private LinkedHashMap<String, Integer> wordCounts = new LinkedHashMap<>();
private LinkedHashMap<String, Integer> wordDocs = new LinkedHashMap<>();

and, into fitOnTexts method, from
Set<String> sequenceSet = new HashSet<>(Arrays.asList(sequence));
to
Set<String> sequenceSet = new LinkedHashSet<String>(Arrays.asList(sequence));

I hope this can be helpful

Version Information

Please indicate relevant versions, including, if relevant:

  • Deeplearning4j version: 1.0.0.-SNAPSHOT
  • platform information (OS, etc): Windows 10
@farizrahman4u

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Fixed #7500

@lock

This comment has been minimized.

Copy link

commented May 9, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators May 9, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.