Apparent bias toward concepts that start with "a" and "b" in the relationship list API results #200

jimkang · 2018-05-06T21:21:22Z

If you call http://api.conceptnet.io/r/PartOf and follow all of the nextPage responses, you end up getting about 1320 relationships. If you count each unique concept referenced in the start and end properties of these relationships, then look at which letters these concepts start with (throwing away 'a ' and 'an ' prefixes), you get this:

  "1": 13,
  "2": 11,
  "4": 1,
  "7": 1,
  "9": 1,
  "b": 352,
  "t": 67,
  "w": 24,
  "a": 720,
  "m": 63,
  "n": 42,
  "s": 69,
  "g": 43,
  "d": 24,
  "i": 25,
  "h": 39,
  "k": 25,
  "j": 17,
  "o": 17,
  "r": 24,
  "p": 59,
  "e": 46,
  "u": 17,
  "v": 15,
  "c": 75,
  "f": 39,
  "l": 44,
  "y": 4,
  "q": 3,
  "z": 7,
  "о": 1,
  "а": 1,
  "č": 1,
  "ö": 1,
  "δ": 6,
  "χ": 2,
  "π": 1,
  "x": 2,
  ".": 1,
  "å": 1
}

That's a lot of a- and b- concepts! If you query the relationship endpoint with AtLocation, you end up with something similarly skewed:

  "0": 1,
  "1": 3,
  "2": 1,
  "3": 4,
  "4": 1,
  "5": 1,
  "6": 1,
  "7": 2,
  "a": 510,
  "t": 283,
  "z": 4,
  "y": 30,
  "h": 54,
  "b": 453,
  "m": 101,
  "c": 117,
  "r": 48,
  "s": 151,
  "w": 42,
  "f": 66,
  "p": 79,
  "g": 29,
  "k": 12,
  "o": 33,
  "d": 36,
  "j": 10,
  "i": 30,
  "l": 43,
  "#": 1,
  "n": 50,
  "v": 14,
  "u": 13,
  "e": 21,
  " ": 1,
  "q": 4,
  "ø": 1
}

Is this because there is an issue with paging stopping early? Or is there an issue with the data?

The text was updated successfully, but these errors were encountered:

rspeer · 2018-05-07T15:03:13Z

Good observation! That looks like an issue that's a lot like paging stopping early -- in particular, the database query stopping early to avoid slowdowns when there are too many results.

Something we're working on that uses machine learning to refine the "weight" value would cause the results to be ranked by something more meaningful than the order they were put in the database, although it would be biased toward certain data sources.

If you need to get all the edges for a common relation, I'd recommend downloading the data and filtering it.

jimkang · 2018-05-07T16:12:03Z

Aha! Thank you, it didn't occur to me that downloading it all was an option!

What I'm trying to accomplish is getting a random edge for a relation. Right now, I think I'll download, convert to ndjson so there's one edge per line, then pick a random line from the file. I don't suppose there's a more direct way to do that?

I see that there's a method for that in the DB wrapper, but it does not appear to be exposed?

rspeer · 2019-04-16T17:17:18Z

One reason we didn't expose the query for random edges is that the ConceptNet server relies on a lot of caching to stay up, and randomness can't be cached.

The bias toward alphabetically-earlier concepts is fixed in 5.7.

rspeer closed this as completed Apr 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparent bias toward concepts that start with "a" and "b" in the relationship list API results #200

Apparent bias toward concepts that start with "a" and "b" in the relationship list API results #200

jimkang commented May 6, 2018 •

edited

rspeer commented May 7, 2018

jimkang commented May 7, 2018

rspeer commented Apr 16, 2019

Apparent bias toward concepts that start with "a" and "b" in the relationship list API results #200

Apparent bias toward concepts that start with "a" and "b" in the relationship list API results #200

Comments

jimkang commented May 6, 2018 • edited

rspeer commented May 7, 2018

jimkang commented May 7, 2018

rspeer commented Apr 16, 2019

jimkang commented May 6, 2018 •

edited