Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are there ~1500 duplicate words here? #6

Closed
farzher opened this issue Jun 8, 2015 · 5 comments
Closed

Why are there ~1500 duplicate words here? #6

farzher opened this issue Jun 8, 2015 · 5 comments

Comments

@farzher
Copy link

farzher commented Jun 8, 2015

Shouldn't the list be deduplicated?

@kylemcdonald
Copy link

Yes, it looks like 20k has 1470 duplicates, and 10 in the usa file:

$ wc -l < 20k.txt 
   19999
$ sort 20k.txt | uniq | wc -l
   18529
$ wc -l < google-10000-english-usa.txt 
    9999
$ sort google-10000-english-usa.txt | uniq | wc -l
    9989

@whitten
Copy link

whitten commented Jun 26, 2015

I don't know. Is it a case sensitivity issue?
Does sort or uniq only get one of "this" and "This" ?

@farzher
Copy link
Author

farzher commented Jun 26, 2015

It's not a case issue. It's exact duplicates. Check using any random dedupe tool.

image

Apparently Word is in there 9 times

@koseki
Copy link
Contributor

koseki commented Jul 18, 2016

It seems to combine two different sources into 20k.txt.

I'm checking the frequency rankings of this list using 20k.txt, and the result is this.

freq-g20k

The original count_1w.txt shows the straight graph.

freq

shot 186

koseki added a commit to koseki/google-10000-english that referenced this issue Jul 18, 2016
worldlywisdom added a commit that referenced this issue Jul 18, 2016
Replace the last half of 20k.txt using count_1w.txt #6
@worldlywisdom
Copy link
Collaborator

Great catch - not sure why the the original source has duplicates. I appreciate the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants