Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

full code review #44

Closed
wants to merge 361 commits into
from

Conversation

Projects
None yet
Member

xchrdw commented Mar 27, 2014

No description provided.

felixniemeyer and others added some commits Jan 23, 2014

@felixniemeyer felixniemeyer Merge branch 'modify_entity_selector_from_extension' of github.com:Wi…
…kidata-lib/PropertySuggester into SuggestionsByPrefix
d05fb37
@felixniemeyer felixniemeyer getting 'more' button right & refactoring 64be709
@xchrdw xchrdw fix more makes 2 requests on empty input 7fcfbc8
@felixniemeyer felixniemeyer adapting GetSuggestions to coding style conventions 8f74053
@xchrdw xchrdw Merge branch 'SuggestionsByPrefix' of https://github.com/Wikidata-lib…
…/property-suggester into modify_entity_selector_from_extension

fix whitespace and some code improvements
42141f6
@xchrdw xchrdw non case sensitive search, add possibility to deactivate suggester wi…
…th paramter "nosuggestions"
6187a89
@xchrdw xchrdw search aliases 6b77849
@xchrdw xchrdw don't add aliases when there is no alias b89004e
@xchrdw xchrdw bla 49d12c0
@Dacry Dacry Merge pull request #24 from Wikidata-lib/modify_entity_selector_from_…
…extension

Modify entity selector from extension to show suggestions
261a022
@xchrdw xchrdw remove hack for empty entity 3d7222c
@xchrdw xchrdw remove "compressed" csv format ce1b042
@xchrdw xchrdw Update SimplePHPSuggester.php c71f301
@felixniemeyer felixniemeyer Merge pull request #25 from Wikidata-lib/remove_hack_for_empty_entity
remove hack for empty entity
fbbedf4
@felixniemeyer felixniemeyer Merge pull request #26 from Wikidata-lib/cleanup_parser
remove "compressed" csv format
50fadf7
@felixniemeyer felixniemeyer implementing first (but useless) test 2e2fad8
@felixniemeyer felixniemeyer now really e812527
@felixniemeyer felixniemeyer adding hook 7557a9a
@Dacry Dacry Access to Property Information changed
Small fix: Only show aliases if available
1e935b2
@Dacry Dacry tabs... a9ab52a
@xchrdw xchrdw use a claim object instead of tupels. 687bef0
@xchrdw xchrdw add api wrapper for suggestion api ab6143d
@xchrdw xchrdw add entity class f6da549
@xchrdw xchrdw move entity and claim to datatypes.py 6ff6c06
@xchrdw xchrdw fix coding style in suggester bf99319
@xchrdw xchrdw Merge pull request #29 from Wikidata-lib/use_claim_object_instead_of_…
…tupels

Use entity and claim objects instead of tupels
da0c355
@xchrdw xchrdw more fixes 746b53f
@xchrdw xchrdw Merge branch 'master' of https://github.com/Wikidata-lib/property-sug…
…gester into apply_coding_guidelines
d80ccb0
@xchrdw xchrdw only count distinct properties ed4fc54
@xchrdw xchrdw fix whitespace 7e6eda8
@xchrdw xchrdw small fix 7c17f5d
@Dacry Dacry -fix CsvReader to have Enity opjects in generator db64887
@Dacry Dacry Merge pull request #30 from Wikidata-lib/apply_coding_guidelines
Apply coding guidelines
ceb94b8
@xchrdw xchrdw use autoload instead of file includes afead1a
@xchrdw xchrdw Merge branch 'more_coding_style' of https://github.com/Wikidata-lib/p…
…roperty-suggester into phpUnitTests

Conflicts:
	.gitignore
2bce0f3
@xchrdw xchrdw fix use b3aa74f
@xchrdw xchrdw remove unused autoload 331f7ea
@xchrdw xchrdw Merge pull request #31 from Wikidata-lib/more_coding_style
use autoload instead of file includes
d5bf17c
@xchrdw xchrdw Merge branch 'master' of https://github.com/Wikidata-lib/property-sug…
…gester into Adjustments

Conflicts:
	GetSuggestions.php
3164c52
@xchrdw xchrdw fix merge c5f207b
@xchrdw xchrdw Merge pull request #27 from Wikidata-lib/Adjustments
Adjustments
8784e8a
@xchrdw xchrdw add schema update 73a397f
@xchrdw xchrdw add schema, first test c109b0c
@xchrdw xchrdw use namespaces 65f5c15
@xchrdw xchrdw change python package structure 9859132
@xchrdw xchrdw move python code to different repo 149c308
@xchrdw xchrdw remove python 9bdb191
@xchrdw xchrdw Merge branch 'master' of https://github.com/Wikidata-lib/property-sug…
…gester into phpUnitTests

Conflicts:
	PropertySuggester.php
0561d17
@xchrdw xchrdw Merge pull request #32 from Wikidata-lib/phpUnitTests
Php unit tests
81f65c3
@Dacry Dacry added testSuggestByItem...
bff8f71
@Dacry Dacry Merge pull request #34 from Wikidata-lib/testing
added testSuggestByItem...
0906be9
@Dacry Dacry testDeprecated Properties... fc52c25
@xchrdw xchrdw add travis config (copy from wikibase) 32b2d00
@xchrdw xchrdw update config 3a1be3e
@Dacry Dacry FixDBUsageForTests f84c205
@xchrdw xchrdw Merge branch 'FixDBUsageForTests' of https://github.com/Wikidata-lib/…
…property-suggester into travis_config
579290b
@xchrdw xchrdw Merge pull request #36 from Wikidata-lib/FixDBUsageForTests
FixDBUsageForTests
1596e62
@Dacry Dacry GetSuggestionsHelper 617090f
@xchrdw xchrdw fix travis config 1bc3caa
@xchrdw xchrdw use php 5.3 and add sqlite for travis 0cd8b66
@felixniemeyer felixniemeyer add maintenance script to load property pair occurrence probability f…
…rom csv file
f24c87b
@xchrdw xchrdw Merge pull request #38 from Wikidata-lib/travis_config_merged
Travis config
e78d4c1
@felixniemeyer felixniemeyer adding tests for maintenance table updater f5333ac
@felixniemeyer felixniemeyer Merge branch 'master' of github.com:Wikidata-lib/PropertySuggester in…
…to add_management_script
1b91dba
@felixniemeyer felixniemeyer implementing tests for maintanance csv data load 9e65088
@felixniemeyer felixniemeyer resolving function()[] issue and reducing to one testmethod c4b62f4
@felixniemeyer felixniemeyer resolving the other conflict too lol f0e4846
@felixniemeyer felixniemeyer renaming a php file 5fa5aa7
@xchrdw xchrdw Merge branch 'master' of https://github.com/Wikidata-lib/property-sug…
…gester into GetSuggestionsHelper

Conflicts:
	src/PropertySuggester/GetSuggestions.php
9ed484b
@xchrdw xchrdw add helper test 1802709
@xchrdw xchrdw improve term matching performance 4cdb418
@Dacry Dacry fix resultSize comparison in loop
move mergeWithTraditionalSearchResults to helper

test mergeWithTraditionalSearchResults
f86d5b0
@Dacry Dacry test generateSuggestions d6909fa
@xchrdw xchrdw test improved 3de5a6d
@xchrdw xchrdw fix some formatting dffb0a3
@xchrdw xchrdw Merge pull request #40 from Wikidata-lib/GetSuggestionsHelper
Get suggestions helper
846a5fc
@xchrdw xchrdw add composer c4ca8fc
@xchrdw xchrdw Merge pull request #41 from Wikidata-lib/add_composer
add composer
c2db9f1
@xchrdw xchrdw add readme 19a899a
@Dacry Dacry Merge pull request #42 from Wikidata-lib/add_readme
add readme
b52b6c9
@xchrdw xchrdw Merge branch 'master' of github.com:Wikidata-lib/property-suggester i…
…nto add_management_script

Conflicts:
	PropertySuggester.php
ff69a87
@xchrdw xchrdw Merge pull request #39 from Wikidata-lib/add_management_script
add maintenance script to load probability table from csv to db
4e00b57
@xchrdw xchrdw fix for composer f9b2d37
@xchrdw xchrdw Merge pull request #43 from Wikidata-lib/define_globals
fix for composer
6ba8f7e
@xchrdw xchrdw Merge branch 'deploy' 5f30d95
@xchrdw xchrdw fix suggester is used without item 51c28f4
@xchrdw xchrdw Merge pull request #46 from Wikidata-lib/fix_suggester_js
fix suggester is used without item
90e3170
@xchrdw xchrdw fix csv import path bd846be
@xchrdw xchrdw fix test 2e1b8e7
@xchrdw xchrdw Merge pull request #47 from Wikidata-lib/fix_management_path
Fix management path
664022d

@addshore addshore commented on an outdated diff Apr 2, 2014

tests/phpunit/maintenance/updateTableTest.php
@@ -0,0 +1,74 @@
+<?php
+
+use PropertySuggester\Maintenance\UpdateTable;
+
+use Wikibase\DataModel\Entity\PropertyId;
@addshore

addshore Apr 2, 2014

Contributor

none of these uses are needed!

@addshore addshore commented on an outdated diff Apr 2, 2014

maintenance/UpdateTable.php
+ * Maintenance script to load probability table from given csv file
+ *
+ * @ingroup Maintenance
+ */
+class UpdateTable extends Maintenance {
+ function __construct() {
+ parent::__construct();
+ $this->mDescription = "Read CSV Dump and refill probability table";
+ $this->addOption( 'file', 'CSV table to be loaded (relative path)', true, true );
+ $this->addOption( 'use-insert', 'Avoid DBS specific import. Use INSERTs.', false, false );
+ $this->addOption( 'silent', 'Do not show information', false, false );
+ }
+
+
+ function execute() {
+ global $wgVersion, $wgTitle, $wgLang;
@addshore

addshore Apr 2, 2014

Contributor

none of these (3) globals are used here

@addshore addshore and 2 others commented on an outdated diff Apr 2, 2014

PropertySuggester.php
@@ -0,0 +1,69 @@
+<?php
+/**
+ * PropertySuggester extension.
+ * License: WTFPL 2.0
+ */
+
+if ( defined( 'PropertySuggester_VERSION' ) ) {
+ // Do not initialize more then once.
+ return;
+}
+
+define( 'PropertySuggester_VERSION', '0.9' );
+
+global $wgExtensionCredits;
@addshore

addshore Apr 2, 2014

Contributor

This is the entry file and is in the global scope, so the above "global $wgExtensionCredits;" is not needed (also lots of the other calls to global are not needed here.)
Of course these would be needed if you used call_user_func() like we do at https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/client/WikibaseClient.php#L45

@xchrdw

xchrdw Apr 2, 2014

Member

These globals were neccessary if the extension is installed by composer. Should we wrap it in a function like it is done in Wikibase?

@brightbyte

brightbyte Apr 3, 2014

Contributor

Personally, I don't care much.

@snaterlicious snaterlicious commented on an outdated diff Apr 2, 2014

PropertySuggester.php
+global $wgSpecialPages;
+$wgSpecialPages['PropertySuggester'] = 'PropertySuggester\SpecialSuggester';
+
+global $wgSpecialPagesGroups;
+$wgSpecialPageGroups['PropertySuggester'] = 'wikibaserepo';
+
+global $wgAPIModules;
+$wgAPIModules['wbsgetsuggestions'] = 'PropertySuggester\GetSuggestions';
+
+global $wgHooks;
+$wgHooks['BeforePageDisplay'][] = 'PropertySuggesterHooks::onBeforePageDisplay';
+$wgHooks['UnitTestsList'][] = 'PropertySuggesterHooks::onUnitTestsList';
+$wgHooks['LoadExtensionSchemaUpdates'][] = 'PropertySuggesterHooks::onCreateSchema';
+
+global $wgResourceModules;
+$wgResourceModules['ext.PropertySuggester'] = array(
@snaterlicious

snaterlicious Apr 2, 2014

You could slim the individual module definitions down a bit using a module template:

    $moduleTemplate = array(
        'localBasePath' => __DIR__,
        'remoteExtPath' => 'PropertySuggester',
    );

... and then apply it to each module:

$wgResourceModules['ext.PropertySuggester'] = $moduleTemplate + array( ... );

In addition, you can drop key-value pairs just featuring an empty array.

@snaterlicious snaterlicious commented on an outdated diff Apr 2, 2014

PropertySuggester.php
+$wgSpecialPageGroups['PropertySuggester'] = 'wikibaserepo';
+
+global $wgAPIModules;
+$wgAPIModules['wbsgetsuggestions'] = 'PropertySuggester\GetSuggestions';
+
+global $wgHooks;
+$wgHooks['BeforePageDisplay'][] = 'PropertySuggesterHooks::onBeforePageDisplay';
+$wgHooks['UnitTestsList'][] = 'PropertySuggesterHooks::onUnitTestsList';
+$wgHooks['LoadExtensionSchemaUpdates'][] = 'PropertySuggesterHooks::onCreateSchema';
+
+global $wgResourceModules;
+$wgResourceModules['ext.PropertySuggester'] = array(
+ 'scripts' => array( 'modules/ext.PropertySuggester.js' ),
+ 'styles' => 'modules/ext.PropertySuggester.css',
+ 'messages' => array(),
+ 'dependencies' => array( 'ext.PropertySuggester.EntitySelector' ),
@snaterlicious

snaterlicious Apr 2, 2014

In Wikibase, we switched to explicitly specifying each dependency directly removing implicit reliance on the dependency chain. It is recommended to do the same here adding dependencies on jQuery itself, the suggester widget and the module providing mediaWiki.config if that is supposed to remain. Similar applies to the second ResourceLoader module.

@snaterlicious snaterlicious commented on an outdated diff Apr 2, 2014

modules/ext.PropertySuggester.EntitySelector.js
@@ -0,0 +1,67 @@
+
+$.widget( 'wikibase.entityselector', $.wikibase.entityselector, {
+ /**
+ * override usual entityselector and replace _request and _create
+ * if a property is requested and we are on an Entity page.
+ *
+ * @see ui.suggester._request
+ */
@snaterlicious

snaterlicious Apr 2, 2014

Please apply function documentation properly.

@snaterlicious snaterlicious commented on an outdated diff Apr 2, 2014

modules/ext.PropertySuggester.EntitySelector.js
+ * if a property is requested and we are on an Entity page.
+ *
+ * @see ui.suggester._request
+ */
+
+ _oldCreate: $.wikibase.entityselector.prototype._create,
+
+ _create: function() {
+ var self = this;
+
+ self._oldCreate.apply(self, arguments);
+
+ if ( self.__useSuggester() ) {
+ var inputHandler = function() {
+ if ( self.value() === '' ) {
+ self.search( '*' );
@snaterlicious

snaterlicious Apr 2, 2014

I wonder whether the initial search in fact is a responsibility of the property suggester. I would tend to remove the _create() method's logic alltogether an attach the event listeners in ext.PropertySuggester.js directly to the input element.

@snaterlicious snaterlicious commented on an outdated diff Apr 2, 2014

modules/ext.PropertySuggester.EntitySelector.js
+ */
+
+ _oldCreate: $.wikibase.entityselector.prototype._create,
+
+ _create: function() {
+ var self = this;
+
+ self._oldCreate.apply(self, arguments);
+
+ if ( self.__useSuggester() ) {
+ var inputHandler = function() {
+ if ( self.value() === '' ) {
+ self.search( '*' );
+ }
+ };
+ self.element.bind('input', inputHandler);
@snaterlicious

snaterlicious Apr 2, 2014

.on() should be preferred to .bind(). The input event has some implementation flaws in some browsers. You might want to use the eachchange event the entityselector itself is using. The event is provided by the ValueView extension (jquery.event.special.eachchange ResourceLoader module).
In addition, you should assign a namepsace to the event, for the event getting detached automatically when destroying the widget.
self.element.on( 'eachchange.' + this.widgetName, function() { ... } );

@snaterlicious snaterlicious commented on an outdated diff Apr 2, 2014

modules/ext.PropertySuggester.EntitySelector.js
+ self._oldCreate.apply(self, arguments);
+
+ if ( self.__useSuggester() ) {
+ var inputHandler = function() {
+ if ( self.value() === '' ) {
+ self.search( '*' );
+ }
+ };
+ self.element.bind('input', inputHandler);
+
+ var focusHandler = function() {
+ if ( self.value() === '' && !self.menu.element.is( ':visible' ) ) {
+ self.search( '*' );
+ }
+ };
+ self.element.focus(focusHandler);
@snaterlicious

snaterlicious Apr 2, 2014

For consistency, you should always prefer .on() to short-cuts.

@snaterlicious snaterlicious commented on an outdated diff Apr 2, 2014

modules/ext.PropertySuggester.EntitySelector.js
+ if ( self.value() === '' ) {
+ self.search( '*' );
+ }
+ };
+ self.element.bind('input', inputHandler);
+
+ var focusHandler = function() {
+ if ( self.value() === '' && !self.menu.element.is( ':visible' ) ) {
+ self.search( '*' );
+ }
+ };
+ self.element.focus(focusHandler);
+ }
+ },
+
+ _oldRequest: $.wikibase.entityselector.prototype._request,
@snaterlicious

snaterlicious Apr 2, 2014

As been discussed before, this is kind of awkward but results out of the entity selector's lack of flexibility. However, still, there might be a cleaner way: $.wikibase.entityselector inherits from $.ui.suggester. $.ui.suggester allows specifying a source option which should map to _request(). See $.ui.suggester._create(). Basically, you would just submit your _request() method as source option to the entity selector and, finally, you might even get around overwriting the entity selector.

@invliD invliD closed this Apr 30, 2014

@xchrdw xchrdw reopened this Apr 30, 2014

@nik9000 nik9000 and 1 other commented on an outdated diff May 7, 2014

maintenance/UpdateTable.php
+ * loads property pair occurrence probability table from given csv file
+ */
+ function execute() {
+ if ( substr( $this->getOption( 'file' ), 0, 2 ) === "--" ) {
+ $this->error( "The --file option requires a file as an argument.\n", true );
+ }
+ $path = $this->getOption( 'file' );
+ $fullPath = realpath( $path );
+ $fullPath = str_replace( '\\', '/', $fullPath );
+
+ if ( !file_exists( $fullPath ) ) {
+ $this->error( "Cant find $path \n", true );
+ }
+
+ $useInsert = $this->getOption( 'use-insert' );
+ $showInfo = !$this->getOption( 'silent' );
@nik9000

nik9000 May 7, 2014

--quiet is part of all maintenance scripts by default and suppresses $this->output from printing anything. Maybe just use that?

@xchrdw

xchrdw May 7, 2014

Member

sounds good, thanks

@nik9000 nik9000 commented on an outdated diff May 7, 2014

maintenance/UpdateTable.php
+ $this->error( "The --file option requires a file as an argument.\n", true );
+ }
+ $path = $this->getOption( 'file' );
+ $fullPath = realpath( $path );
+ $fullPath = str_replace( '\\', '/', $fullPath );
+
+ if ( !file_exists( $fullPath ) ) {
+ $this->error( "Cant find $path \n", true );
+ }
+
+ $useInsert = $this->getOption( 'use-insert' );
+ $showInfo = !$this->getOption( 'silent' );
+ $tableName = 'wbs_propertypairs';
+
+ wfWaitForSlaves( 5 );
+ $lb = wfGetLB( DB_MASTER );
@nik9000

nik9000 May 7, 2014

My documentation says the parameter for this should be the wiki name or false or left out entirely. Am I out of date?

nik9000 commented May 7, 2014

I'm wondering about transactionality, how long this takes, and slave lag. I'm not really familiar with mediawiki's transaction handling so I might pass this off to someone else for a review there. I'm reasonably sure that blasting and rebuilding a large table from scratch would cause trouble with slave lag but I don't know how large you have to be to trigger it and I don't know how large the table is.

@nik9000 nik9000 commented on an outdated diff May 7, 2014

...pertySuggester/UpdateTable/Importer/MySQLImporter.php
+
+ /**
+ * @param ImportContext $importContext
+ * @return bool
+ */
+ function importFromCsvFileToDb( ImportContext $importContext ) {
+
+ $lb = $importContext->getLb();
+ $db = $lb->getConnection( DB_MASTER );
+
+ $dbTableName = $db->tableName( $importContext->getTargetTableName() );
+ $fullPath = $db->addQuotes( $importContext->getCsvFilePath() );
+ $delimiter = $db->addQuotes( $importContext->getCsvDelimiter() );
+
+ $db->query("
+ LOAD DATA INFILE $fullPath
@nik9000

nik9000 May 7, 2014

Might be best to document that this only works if you run the script on the same server as the MySQL server or you've somehow sent the server the file. I don't know if WMF will allow either one.

@nik9000

nik9000 May 8, 2014

I don't think this command will work well with WMF because we won't have the files on disk and it'll bulk load too many at a time.

@nik9000 nik9000 and 1 other commented on an outdated diff May 7, 2014

sql/create_propertypairs.sql
@@ -0,0 +1,17 @@
+
+CREATE TABLE IF NOT EXISTS /*_*/wbs_propertypairs (
+ row_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
@nik9000

nik9000 May 7, 2014

How many of these are you going to be inserting each run? This'll start to fail after four billion or so inserts. Not that the table has to be four billion records long, just four billion inserts. If we do ten million a night this'll start crashing after a year and two months or so.

@xchrdw

xchrdw May 7, 2014

Member

there are currently about 60000 rows in the table.

@xchrdw

xchrdw May 7, 2014

Member

would you recommend a 64bit key or a alternatively a combined key over (pid1, qid1, pid2, context)?

@nik9000

nik9000 May 7, 2014

Personally I'd go with a 64 bit key and forget about it. I remember something about mysql creating a hidden primary key column if a single column wasn't already the primary key. I'll ask our dba.

@nik9000

nik9000 May 8, 2014

I got an email back from our dba last night and the INT unsigned ought to be fine BUT the load mechanism probably isn't.

@nik9000 nik9000 commented on an outdated diff May 8, 2014

maintenance/UpdateTable.php
+ }
+ $path = $this->getOption( 'file' );
+ $fullPath = realpath( $path );
+ $fullPath = str_replace( '\\', '/', $fullPath );
+
+ if ( !file_exists( $fullPath ) ) {
+ $this->error( "Cant find $path \n", true );
+ }
+
+ $useInsert = $this->getOption( 'use-insert' );
+ $tableName = 'wbs_propertypairs';
+
+ wfWaitForSlaves( 5 );
+ $lb = wfGetLB();
+
+ $this->clearTable( $lb, $tableName );
@nik9000

nik9000 May 8, 2014

So the whole blast the contents of the table then rebuild them isn't great for us. If you do it outside of a transaction then the whole thing blips during the rebuild process and if you do it inside a transaction, and you have more then a couple thousand rows, it makes replication angry.

@nik9000

nik9000 May 8, 2014

Maybe, instead, you can do a upsert for all the records? Its a MySQL-ism, but it'd work. PostgreSQL has an equivelent construction but it probably isn't worth customizing for them. Just do a query, check what has to change, then send the updates, I guess.

@nik9000 nik9000 commented on an outdated diff May 8, 2014

maintenance/UpdateTable.php
+ if ( !file_exists( $fullPath ) ) {
+ $this->error( "Cant find $path \n", true );
+ }
+
+ $useInsert = $this->getOption( 'use-insert' );
+ $tableName = 'wbs_propertypairs';
+
+ wfWaitForSlaves( 5 );
+ $lb = wfGetLB();
+
+ $this->clearTable( $lb, $tableName );
+
+ $this->output( "loading new entries from file\n" );
+
+ $importContext = $this->createImportContext( $lb, $tableName, $fullPath );
+ $insertionStrategy = $this->createImportStrategy( $useInsert );
@nik9000

nik9000 May 8, 2014

Might want to rename importStrategy

@nik9000 nik9000 and 1 other commented on an outdated diff May 8, 2014

...pertySuggester/UpdateTable/Importer/BasicImporter.php
+ * @param DatabaseBase $db
+ * @param ImportContext $importContext
+ * @throws UnexpectedValueException
+ */
+ private function doImport( $fileHandle, DatabaseBase $db, ImportContext $importContext ) {
+ $accumulator = array();
+ $i = 0;
+ $header = fgetcsv( $fileHandle, 0, $importContext->getCsvDelimiter() ); //this is to get the csv-header
+ $expectedHeader = array( 'pid1', 'qid1', 'pid2', 'count', 'probability', 'context' );
+ if( $header != $expectedHeader ) {
+ throw new UnexpectedValueException( "provided csv-file does not match the expected format:\n" . join( ',', $expectedHeader ) );
+ }
+ while ( true ) {
+ $data = fgetcsv( $fileHandle, 0, $importContext->getCsvDelimiter() );
+
+ if ( $data == false || ++$i > 1000 ) {
@nik9000

nik9000 May 8, 2014

Make the 1000 configurable via a batch-size parameter. I'm pretty sure that, like the quiet parameter, is built in to maintenance scripts.

@JeroenDeDauw

JeroenDeDauw May 8, 2014

Contributor

At the very least thing should not appear as a magical number in the middle of the code. If it was left a hardcoded thing, putting it in a named constant would be better

@nik9000

nik9000 May 8, 2014

To be clear, making this batch size come from a parameter to the script is a deployment requirement. So say the DBA.

Sent from my iPhone

On May 8, 2014, at 7:35 PM, Jeroen De Dauw notifications@github.com wrote:

In src/PropertySuggester/UpdateTable/Importer/BasicImporter.php:

  • * @Param DatabaseBase $db
  • * @Param ImportContext $importContext
  • * @throws UnexpectedValueException
  • */
  • private function doImport( $fileHandle, DatabaseBase $db, ImportContext $importContext ) {
  •   $accumulator = array();
    
  •   $i = 0;
    
  •   $header = fgetcsv( $fileHandle, 0, $importContext->getCsvDelimiter() ); //this is to get the csv-header
    
  •   $expectedHeader = array( 'pid1', 'qid1', 'pid2', 'count', 'probability', 'context' );
    
  •   if( $header != $expectedHeader ) {
    
  •       throw new UnexpectedValueException( "provided csv-file does not match the expected format:\n" . join( ',', $expectedHeader ) );
    
  •   }
    
  •   while ( true ) {
    
  •       $data = fgetcsv( $fileHandle, 0, $importContext->getCsvDelimiter() );
    
  •       if ( $data == false || ++$i > 1000 ) {
    
    At the very least thing should not appear as a magical number in the middle of the code. If it was left a hardcoded thing, putting it in a named constant would be better


Reply to this email directly or view it on GitHub.

@nik9000 nik9000 commented on an outdated diff May 8, 2014

...pertySuggester/UpdateTable/Importer/BasicImporter.php
+ * @param ImportContext $importContext
+ * @throws UnexpectedValueException
+ */
+ private function doImport( $fileHandle, DatabaseBase $db, ImportContext $importContext ) {
+ $accumulator = array();
+ $i = 0;
+ $header = fgetcsv( $fileHandle, 0, $importContext->getCsvDelimiter() ); //this is to get the csv-header
+ $expectedHeader = array( 'pid1', 'qid1', 'pid2', 'count', 'probability', 'context' );
+ if( $header != $expectedHeader ) {
+ throw new UnexpectedValueException( "provided csv-file does not match the expected format:\n" . join( ',', $expectedHeader ) );
+ }
+ while ( true ) {
+ $data = fgetcsv( $fileHandle, 0, $importContext->getCsvDelimiter() );
+
+ if ( $data == false || ++$i > 1000 ) {
+ $db->insert( $importContext->getTargetTableName(), $accumulator );
@nik9000

nik9000 May 8, 2014

Probably a good idea to spit out a output line every time you do a batch. And, yeah, like I said earlier, it'll have to be upserts or query, pick rows to change, then update. And insert, I guess.

@nik9000

nik9000 May 8, 2014

Oh! Also, I'm told this is where you are supposed to wait for slave lag. The idea is that inserting a batch can push the slave lag up so you should wait for it to go back down before adding more.

@nik9000 nik9000 commented on the diff May 8, 2014

sql/create_propertypairs.sql
@@ -0,0 +1,17 @@
+
+CREATE TABLE IF NOT EXISTS /*_*/wbs_propertypairs (
+ row_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
+ pid1 INT unsigned NOT NULL,
+ qid1 INT unsigned NULL,
+ pid2 INT unsigned NOT NULL,
+ count INT unsigned NOT NULL,
+ probability FLOAT NOT NULL,
+ context VARBINARY(32) NOT NULL
@nik9000

nik9000 May 8, 2014

Probably a nit pick but it'd be smaller to store these are small integers or something. 60k probably isn't enough rows for that to matter.

@nik9000 nik9000 commented on an outdated diff May 8, 2014

sql/create_propertypairs.sql
@@ -0,0 +1,17 @@
+
+CREATE TABLE IF NOT EXISTS /*_*/wbs_propertypairs (
+ row_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
+ pid1 INT unsigned NOT NULL,
+ qid1 INT unsigned NULL,
+ pid2 INT unsigned NOT NULL,
+ count INT unsigned NOT NULL,
+ probability FLOAT NOT NULL,
+ context VARBINARY(32) NOT NULL
+) /*$wgDBTableOptions*/;
+
+
+CREATE INDEX /*i*/propertypairs_pid1_pid2_qid1 ON /*_*/wbs_propertypairs (pid1, qid1, pid2, context);
+
+CREATE INDEX /*i*/propertypairs_pid1_pid2 ON /*_*/wbs_propertypairs (pid1, pid2, context);
@nik9000

nik9000 May 8, 2014

I can't find where this or the next index are used. I know that's kind of a difficult thing to do by code review, but it looks like only the pid1 qid1 pid2 context index is used. My dba kept talking about the number of indexes being a big deal to keep replication happy.

Member

xchrdw commented May 10, 2014

Thanks for your feedback. We hope to have fixed all problems by now. We decided not to use upserts as it isn't part of the sql standard and we would have to impement it differently for sqlite or other databases that should be supported. We think that it shouldn't be a problem since there are only about 60,000 rows in the table. There still is the LOAD TABLE statement but you can switch to the normal inserts with --use-inserts. We added a bigint key, a batchsize and took care of the possible slavelag. Please tell us if there is some blocking issue left for us to fix.

nik9000 commented May 12, 2014

Thanks for your feedback. We hope to have fixed all problems by now. We decided not to use upserts as it isn't part of the sql standard and we would have to impement it differently for sqlite or other databases that should be supported.

Probably a good call.

There still is the LOAD TABLE statement but you can switch to the normal inserts with --use-inserts.

Cool.

We added a bigint key, a batchsize and took care of the possible slavelag. Please tell us if there is some blocking issue left for us to fix.

Will review again. Man, sometimes I wish github was gerrit and I could look at a squashed view of what you changed since the last time I looked. Maybe there is a way, my github foo isn't strong.

@nik9000 nik9000 commented on an outdated diff May 12, 2014

...pertySuggester/UpdateTable/Importer/BasicImporter.php
+ $batchSize = $importContext->getBatchSize();
+ $i = 0;
+ $header = fgetcsv( $fileHandle, 0, $importContext->getCsvDelimiter() ); //this is to get the csv-header
+ $expectedHeader = array( 'pid1', 'qid1', 'pid2', 'count', 'probability', 'context' );
+ if( $header != $expectedHeader ) {
+ throw new UnexpectedValueException( "provided csv-file does not match the expected format:\n" . join( ',', $expectedHeader ) );
+ }
+ while ( true ) {
+ $data = fgetcsv( $fileHandle, 0, $importContext->getCsvDelimiter() );
+
+ if ( $data == false || ++$i % $batchSize == 0 ) {
+ $db->commit( __METHOD__, 'flush' );
+ wfWaitForSlaves();
+ $db->insert( $importContext->getTargetTableName(), $accumulator );
+ if ( ! $importContext->isQuiet() ) {
+ print "$i rows inserted";
@nik9000

nik9000 May 12, 2014

Might be simpler to have importContext have an output method that forwards to the Maintenance's output. No big deal either way. No need to change it now.

@nik9000 nik9000 and 1 other commented on an outdated diff May 12, 2014

maintenance/UpdateTable.php
+ $importContext->setCsvDelimiter( ',' );
+ $importContext->setBatchSize( $this->mBatchSize );
+ $importContext->setQuiet( $quiet );
+
+ return $importContext;
+ }
+
+ /**
+ * @param LoadBalancer $lb
+ * @param string $tableName
+ */
+ private function clearTable( LoadBalancer $lb, $tableName ) {
+ $db = $lb->getConnection( DB_MASTER );
+ if ( $db->tableExists( $tableName ) ) {
+ $this->output( "removing old entries\n" );
+ $db->delete( $tableName, '*' );
@nik9000

nik9000 May 12, 2014

Checking with the DBA whether this needs to be in a while (delete() > 0) {checkForLag} kind of loop.... Man, replication is a pain.

@nik9000

nik9000 May 14, 2014

I haven't yet heard back. Might be best to implement it in such a loop if you have time. I'll let you know once I hear back but I have a suspicion that it'll have to be batched rather then all at once.

@Dacry

Dacry May 15, 2014

Contributor

Its been done - should be fine now - if you feel there is anything left to do please let us know ;)

Member

xchrdw commented May 12, 2014

You can use the compare view but you need to enter the revision hashes manually: 09a79e7...b467815

@nik9000 nik9000 and 1 other commented on an outdated diff May 21, 2014

maintenance/UpdateTable.php
+ /**
+ * @param LoadBalancer $lb
+ * @param string $tableName
+ * @param string $primaryKey
+ */
+ private function clearTable( LoadBalancer $lb, $tableName, $primaryKey ) {
+ $db = $lb->getConnection( DB_MASTER );
+ if ( !$db->tableExists( $tableName ) ) {
+ $this->error( "$tableName table does not exist.\nExecuting core/maintenance/update.php may help.\n", true );
+ }
+ $this->output( "removing old entries\n" );
+ while ( 1 ) {
+ $db->commit( __METHOD__, 'flush' );
+ wfWaitForSlaves();
+
+ $idChunk = $db->select(
@nik9000

nik9000 May 21, 2014

Why not just delete with a limit? I imagine this select then delete is harder on the database. I imagine you have a reason but maybe it'd be good to add it as a comment?

@xchrdw

xchrdw May 21, 2014

Member

sqlite doesn't support delete limit. but I just recognized that sqlite probably wont have slaves :D Maybe it would be best to have one insertion/deletion algorithm for mysql with limits and one without limits for sqlite.

@nik9000

nik9000 May 21, 2014

I think that is a good idea.

On Wed, May 21, 2014 at 7:56 AM, Christian Dullweber <
notifications@github.com> wrote:

In maintenance/UpdateTable.php:

  • /**
  • * @Param LoadBalancer $lb
  • * @Param string $tableName
  • * @Param string $primaryKey
  • */
  • private function clearTable( LoadBalancer $lb, $tableName, $primaryKey ) {
  •   $db = $lb->getConnection( DB_MASTER );
    
  •   if ( !$db->tableExists( $tableName ) ) {
    
  •       $this->error( "$tableName table does not exist.\nExecuting core/maintenance/update.php may help.\n", true );
    
  •   }
    
  •   $this->output( "removing old entries\n" );
    
  •   while ( 1 ) {
    
  •       $db->commit( **METHOD**, 'flush' );
    
  •       wfWaitForSlaves();
    
  •       $idChunk = $db->select(
    

sqlite doesn't support delete limit. but I just recognized that sqlite
probably wont have slaves :D Maybe it would be best to have one
insertion/deletion algorithm for mysql with limits and one without limits
for sqlite.


Reply to this email directly or view it on GitHubhttps://github.com/Wikidata-lib/PropertySuggester/pull/44/files#r12892057
.

@nik9000 nik9000 commented on the diff May 30, 2014

src/PropertySuggester/Suggesters/SimpleSuggester.php
+ protected function getSuggestions( array $propertyIds, $limit, $minProbability ) {
+ if ( !is_int( $limit ) ) {
+ throw new InvalidArgumentException('$limit must be int!');
+ }
+ if ( !is_float( $minProbability ) ) {
+ throw new InvalidArgumentException('$minProbability must be float!');
+ }
+ if ( !$propertyIds ) {
+ return array();
+ }
+ $excludedIds = array_merge( $propertyIds, $this->deprecatedPropertyIds );
+ $count = count( $propertyIds );
+ $context = 'item'; // only 'item' is supported at the moment
+
+ $dbr = $this->lb->getConnection( DB_SLAVE );
+ $res = $dbr->select(
@nik9000

nik9000 May 30, 2014

How long does this take on production sized data?

@xchrdw

xchrdw Jun 5, 2014

Member

this takes about 60ms for a really big item (Berlin from Wikidata) on my local machine

@nik9000 nik9000 commented on an outdated diff May 30, 2014

src/PropertySuggester/Suggesters/SimpleSuggester.php
+ /**
+ * @param int[] $deprecatedPropertyIds
+ */
+ public function setDeprecatedPropertyIds( array $deprecatedPropertyIds ) {
+ $this->deprecatedPropertyIds = $deprecatedPropertyIds;
+ }
+
+ /**
+ * @param int[] $propertyIds
+ * @param int $limit
+ * @param float $minProbability
+ * @throws InvalidArgumentException
+ * @return Suggestion[]
+ */
+ protected function getSuggestions( array $propertyIds, $limit, $minProbability ) {
+ if ( !is_int( $limit ) ) {
@nik9000

nik9000 May 30, 2014

This should probably have

$profiler = new ProfileSection( __METHOD__ );

And that should be sprinked on all the potentially "slow" method.

@JeroenDeDauw JeroenDeDauw commented on an outdated diff Jun 12, 2014

src/PropertySuggester/SuggestionGenerator.php
+ throw new InvalidArgumentException( 'Item ' . $id . ' could not be found' );
+ }
+ $suggestions = $this->suggester->suggestByItem( $item, $limit, $minProbability );
+ return $suggestions;
+ }
+
+ /**
+ * @param string[] $propertyList - A list of property ids
+ * @param int $limit
+ * @param float $minProbability
+ * @return Suggestion[]
+ */
+ public function generateSuggestionsByPropertyList( array $propertyList, $limit, $minProbability ) {
+ $properties = array();
+ foreach ( $propertyList as $id ) {
+ $properties[] = PropertyId::newFromNumber( $this->getNumericPropertyId( $id ) );
@JeroenDeDauw

JeroenDeDauw Jun 12, 2014

Contributor

Looks like you just want to do new PropertyId( $stringId )

The var names $properties and $propertyList are bad. They contain instances of PropertyId and property id strings, respectively.

Dacry and others added some commits Jun 12, 2014

@Dacry Dacry fix ResultSize + fix naming + strategy in generateSuggestionsByProper…
…tyList
fd78f9a
@Dacry Dacry fix test b130da5
@xchrdw xchrdw only include javascript in item-namespace, batchsize=10000, escape $m…
…inProbability, remove mysqlimporter
f3681f8
@xchrdw xchrdw fix get suggestions with increasing continue value 515d8fa
@xchrdw xchrdw Merge pull request #75 from Wikidata-lib/fix_review_issues_2
fix review issues 2
87bd99b
@xchrdw xchrdw Merge pull request #74 from Wikidata-lib/fixResultSize
fix ResultSize + fix naming + strategy in generateSuggestionsByPropertyList
7ecb320
@xchrdw xchrdw Merge branch 'review' of github.com:Wikidata-lib/property-suggester 269a173
@xchrdw xchrdw remove condition for autoloader, secure $minProbability 4523de9
@xchrdw xchrdw spaces 14d2b25
@xchrdw xchrdw Merge pull request #76 from Wikidata-lib/fix_review_issues_3
remove condition for autoloader, secure $minProbability
4c92e66
@filbertkm filbertkm Fix packagist links in readme 24c94ec
@xchrdw xchrdw Merge pull request #77 from filbertkm/readme
Fix packagist links in readme
0c3af79
@filbertkm filbertkm Check that vendor/autoload.php exists
If we install property suggester via composer,
by including it in the Wikidata "build", in
MediaWiki's composer.json or other such setup
then autoload.php will be elsewhere and handled
elsewhere.
1ff7e8b
@xchrdw xchrdw Merge pull request #78 from filbertkm/autoload
Check that vendor/autoload.php exists
3b3bb81
@filbertkm filbertkm Put PropertySuggester in wikibase group in Special:Version 6bc0b8d
@filbertkm filbertkm Remove reference to wikidata in description
Best to make the description general without
specific mention of Wikidata.
38ce25c
@xchrdw xchrdw Merge pull request #79 from filbertkm/description
Improve extension description
f0f5fb3
@xchrdw xchrdw fix qualifier were counted as mainsnak e643c31
@xchrdw xchrdw Merge pull request #80 from Wikidata-lib/fix_bug
fix qualifier were counted as mainsnak
32ab53d
@filbertkm filbertkm Fix remoteExtPath for installation in non-standard locations
Same hack used in Wikibase, ValueView, etc. to allow this
to work if an extension is installed in a non-standard place.
a75f765
@xchrdw xchrdw Merge pull request #82 from filbertkm/remoteextpath
Fix remoteExtPath for installation in non-standard locations
2fb712f

@AaronSchulz AaronSchulz commented on the diff Jun 26, 2014

maintenance/UpdateTable.php
+$basePath = getenv( 'MW_INSTALL_PATH' ) !== false ? getenv( 'MW_INSTALL_PATH' ) : __DIR__ . '/../../..';
+require_once $basePath . '/maintenance/Maintenance.php';
+
+/**
+ * Maintenance script to load property pair occurrence probability table from given csv file
+ *
+ * @author BP2013N2
+ * @licence GNU GPL v2+
+ */
+class UpdateTable extends Maintenance {
+
+ function __construct() {
+ parent::__construct();
+ $this->mDescription = "Read CSV Dump and refill probability table";
+ $this->addOption( 'file', 'CSV table to be loaded (relative path)', true, true );
+ $this->setBatchSize( 10000 );
@AaronSchulz

AaronSchulz Jun 26, 2014

1000 would be more appropriate. The current value might cause some slave lag due to having too many rows at once.

@xchrdw

xchrdw Jun 26, 2014

Member

We currently use 10000 as suggested here: https://bugzilla.wikimedia.org/show_bug.cgi?id=63224#c5
Should it be reduced to 1000?

@AaronSchulz

AaronSchulz Jun 26, 2014

Preferably. Of course it always be set at run-time, but 1000 seem like a safer default (and doesn't lose much speed vs 10000).

@AaronSchulz AaronSchulz commented on the diff Jun 26, 2014

sql/create_propertypairs.sql
@@ -0,0 +1,13 @@
+
@AaronSchulz

AaronSchulz Jun 26, 2014

It would be nice if the fields had one line description comments.

@AaronSchulz AaronSchulz commented on the diff Jun 26, 2014

sql/create_propertypairs.sql
@@ -0,0 +1,13 @@
+
+CREATE TABLE IF NOT EXISTS /*_*/wbs_propertypairs (
+ row_id BIGINT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
+ pid1 INT unsigned NOT NULL,
+ qid1 INT unsigned NULL,
+ pid2 INT unsigned NOT NULL,
+ count INT unsigned NOT NULL,
+ probability FLOAT NOT NULL,
+ context VARBINARY(32) NOT NULL
+) /*$wgDBTableOptions*/;
+
+
+CREATE INDEX /*i*/propertypairs_pid1_pid2_qid1_context ON /*_*/wbs_propertypairs (pid1, qid1, pid2, context);
@AaronSchulz

AaronSchulz Jun 26, 2014

What order of size will this table be?

@filbertkm

filbertkm Jun 26, 2014

Contributor

will be ~84,000 rows initially

@AaronSchulz AaronSchulz commented on the diff Jun 26, 2014

src/PropertySuggester/Suggesters/SimpleSuggester.php
+ 'wbs_propertypairs',
+ array( 'pid' => 'pid2', 'prob' => "sum(probability)/$count" ),
+ array( 'pid1 IN (' . $dbr->makeList( $propertyIds ) . ')',
+ 'qid1' => null,
+ 'pid2 NOT IN (' . $dbr->makeList( $excludedIds ) . ')',
+ 'context' => $context ),
+ __METHOD__,
+ array(
+ 'GROUP BY' => 'pid2',
+ 'ORDER BY' => 'prob DESC',
+ 'LIMIT' => $limit,
+ 'HAVING' => 'prob > ' . floatval( $minProbability )
+ )
+ );
+ $this->lb->reuseConnection( $dbr );
+
@AaronSchulz

AaronSchulz Jun 26, 2014

Is there are sense of how many rows there are per pid1 values, per (pid1, qid1) tuples, and per (pid1, qid1, pid2) tuples. How often is qid1 going to be null? Also, what are the variations that context might have? The current secondary index looks plausible for the WHERE, though it's hard to say for sure without knowing the selectivity.

What range of values might $limit have? This query will need a quicksort, though that's OK as long as the number of matching rows from the WHERE is moderate. Likewise the HAVING can't use any index, though it's ok if the matching rows from the WHERE are moderate. Is this likely to be the case?

@xchrdw

xchrdw Jun 26, 2014

Member

There are currently 80.000 rows generated from the last wikidata dump. The default limit will be 7 as this will mostly be queried by the entityselector widget. The limit is increased to 500 in case the suggestions will be filtered by a string afterwards. At the moment qid1 is null in all cases as the code that uses values for suggestions is not ready (and will probably not be finished during the project). We made some measurements on the performance with and without index on a local mysql db:
Here is the result with suggestions for differently sized input property-sets:
https://raw.githubusercontent.com/wiki/Wikidata-lib/PropertySuggester/img/performance.png
(labels: query time / number of properties)
with the current state of wikidata most queries will be for relatively small items:
https://raw.githubusercontent.com/wiki/Wikidata-lib/PropertySuggester/img/property-count.png
(labels: number properties per item / number of items)

@xchrdw xchrdw closed this Jul 1, 2014

@mariushoch mariushoch deleted the review branch Mar 24, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment