Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent bias toward concepts that start with "a" and "b" in the relationship list API results #200

Closed
jimkang opened this issue May 6, 2018 · 3 comments

Comments

@jimkang
Copy link

jimkang commented May 6, 2018

If you call http://api.conceptnet.io/r/PartOf and follow all of the nextPage responses, you end up getting about 1320 relationships. If you count each unique concept referenced in the start and end properties of these relationships, then look at which letters these concepts start with (throwing away 'a ' and 'an ' prefixes), you get this:

  "1": 13,
  "2": 11,
  "4": 1,
  "7": 1,
  "9": 1,
  "b": 352,
  "t": 67,
  "w": 24,
  "a": 720,
  "m": 63,
  "n": 42,
  "s": 69,
  "g": 43,
  "d": 24,
  "i": 25,
  "h": 39,
  "k": 25,
  "j": 17,
  "o": 17,
  "r": 24,
  "p": 59,
  "e": 46,
  "u": 17,
  "v": 15,
  "c": 75,
  "f": 39,
  "l": 44,
  "y": 4,
  "q": 3,
  "z": 7,
  "о": 1,
  "а": 1,
  "č": 1,
  "ö": 1,
  "δ": 6,
  "χ": 2,
  "π": 1,
  "x": 2,
  ".": 1,
  "å": 1
}

That's a lot of a- and b- concepts! If you query the relationship endpoint with AtLocation, you end up with something similarly skewed:

  "0": 1,
  "1": 3,
  "2": 1,
  "3": 4,
  "4": 1,
  "5": 1,
  "6": 1,
  "7": 2,
  "a": 510,
  "t": 283,
  "z": 4,
  "y": 30,
  "h": 54,
  "b": 453,
  "m": 101,
  "c": 117,
  "r": 48,
  "s": 151,
  "w": 42,
  "f": 66,
  "p": 79,
  "g": 29,
  "k": 12,
  "o": 33,
  "d": 36,
  "j": 10,
  "i": 30,
  "l": 43,
  "#": 1,
  "n": 50,
  "v": 14,
  "u": 13,
  "e": 21,
  " ": 1,
  "q": 4,
  "ø": 1
}

Is this because there is an issue with paging stopping early? Or is there an issue with the data?

@rspeer
Copy link
Member

rspeer commented May 7, 2018

Good observation! That looks like an issue that's a lot like paging stopping early -- in particular, the database query stopping early to avoid slowdowns when there are too many results.

Something we're working on that uses machine learning to refine the "weight" value would cause the results to be ranked by something more meaningful than the order they were put in the database, although it would be biased toward certain data sources.

If you need to get all the edges for a common relation, I'd recommend downloading the data and filtering it.

@jimkang
Copy link
Author

jimkang commented May 7, 2018

Aha! Thank you, it didn't occur to me that downloading it all was an option!

What I'm trying to accomplish is getting a random edge for a relation. Right now, I think I'll download, convert to ndjson so there's one edge per line, then pick a random line from the file. I don't suppose there's a more direct way to do that?

I see that there's a method for that in the DB wrapper, but it does not appear to be exposed?

@rspeer
Copy link
Member

rspeer commented Apr 16, 2019

One reason we didn't expose the query for random edges is that the ConceptNet server relies on a lot of caching to stay up, and randomness can't be cached.

The bias toward alphabetically-earlier concepts is fixed in 5.7.

@rspeer rspeer closed this as completed Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants