Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IGNITE-13713 [ML]: Add target encoding preprocessor #8466

Merged
merged 1 commit into from
Dec 9, 2020

Conversation

mrk-andreev
Copy link
Contributor

  • Add target encoding preprocessor

Issue: https://issues.apache.org/jira/browse/IGNITE-13713

Copy link
Member

@zaleslaw zaleslaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this PR makes me happy, but raises a few questions:

  1. Use-case with GBT looks strange for me
  2. So large dataset in 32 000 rows looks big for resources too. Could we use another dataset instead of proposed? Titanic or something else with 100-1000 rows.

Let's discuss it here, in this PR

strEncoderPreprocessor
);

Preprocessor<Integer, Object[]> lbEncoderPreprocessor = new EncoderTrainer<Integer, Object[]>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I didn't understand this pipeline? Why those 3 encoders are combined here? Could they work only in this combination?
In my opinion, user have a choice what to do with Strings, but he should choose one method (not the chain of methods).

Please share your vision here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we want use EncoderType.TARGET_ENCODER only for a few columns (may be only one). In this example I use EncoderType.STRING_ENCODER as general propose encoder and EncoderType.TARGET_ENCODER for special one.

});

double[][] postProcessedData = new double[][] {
{1.0, 0.1, 1.0},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain please numbers in the last columns: why are they 1.0 and 2.0? not 0.33 and 0.66

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Described each case

* encodedValue = globalTargetMean * (1 - alpha) + categoryTargetMean * alpha
* if categorySize == 1 then use globalTargetMean
*
* min_samples_leaf - minimum samples to take category average into account.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like min_samples_leaf is not used in this class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it is used for evaluate TargetEncodingMeta, so I wanted to mention this in encoder class.

int finalI = i;

targetEncodingMetas[i] = new TargetEncodingMeta(
targetCounters[i].getTargetSum() / targetCounters[i].getTargetCount(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to refactor constructor parameters to separate variables for readability purposes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to new method

.collect(Collectors.toMap(
Map.Entry::getKey,
value -> {
double prior = targetCounters[finalI].getTargetSum() /
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this lambda should be encapsulated and commented separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prior evaluation extracted but lambda still exists.

);
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
else if (featureVal instanceof String)
strVal = (String)featureVal;
else if (featureVal instanceof Double)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add type conversion to Doulbe from another Number types (and boolean)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Number & Boolean

@zaleslaw zaleslaw merged commit b6ecc82 into apache:master Dec 9, 2020
anton-vinogradov pushed a commit that referenced this pull request Dec 28, 2020
* IGNITE-13672 [ML]: Add initial JSON export/import support for all models (#8521)

* [IGNITE-13672] Initial solution

* [IGNITE-13672] Added an example

* [IGNITE-13672] Added a draft solution

* [IGNITE-13672] Updated JSON model

* [IGNITE-13672] Updated JSON model

* [IGNITE-13672] Removed GMM support

* [IGNITE-13672] Fixed blank lines

* [IGNITE-13672] Fixed licenses

* [IGNITE-13672] Fixed whitespaces

* [IGNITE-13672] Fixed whitespaces

* [IGNITE-13672] Fixed whitespaces

* [IGNITE-13672] Fixed examples

* [IGNITE-13672] Fixed examples

* [IGNITE-13672] Fixed test

* IGNITE-13388 Fix apache-ignite deb package dependency on JVM package - Fixes #8191.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-13770 Fix NPE in Ignite.dataRegionMetrics with empty persistent region - Fixes #8506.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-13640 Added runtime dependencies to opencensus module. Fixes #8406

Signed-off-by: Slava Koptilin <slava.koptilin@gmail.com>

* IGNITE-13520 Skip generating encryption keys on the client node. (#8317)

* IGNITE-13496 Java thin: make async API non-blocking with GridNioServer

Refactor Java Thin Client to use GridNioServer in client mode:
* Client threads are never blocked
* Single worker thread is shared across all connections within `IgniteClient`

Benchmark results (i7-9700K, Ubuntu 20.04.1, JDK 1.8.0_275):

Before
Benchmark                         Mode  Cnt      Score      Error  Units
JmhThinClientCacheBenchmark.get  thrpt   10  65916.805 ± 2118.954  ops/s
JmhThinClientCacheBenchmark.put  thrpt   10  62304.444 ± 2521.371  ops/s

After
Benchmark                         Mode  Cnt      Score      Error  Units
JmhThinClientCacheBenchmark.get  thrpt   10  92501.557 ± 1380.384  ops/s
JmhThinClientCacheBenchmark.put  thrpt   10  82907.446 ± 7572.537  ops/s

* IGNITE-13793: Implement SQLRowCount for SELECT

This closes #8525

* [IGNITE-13803] Fixed Scalar test failed due to incorrect Jackson dependency (#8529)

* [IGNITE-13803] Changed dependency

* [IGNITE-13803] Exclude dependency

* IGNITE-13190 Native Persistence Defragmentation core functionality - Fixes #7984.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13742 INACTIVE mode is forced on nodes in Maintenance Mode - Fixes #8524.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13807 [MINOR] Fix error message in tests. (#8530)

* IGNITE-13795 Added escaping of node consistent id in diagnostic pagelock dump file name. - Fixes #8526.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13776 BPlus tree lock retries limit reached with sqlOnHeapCacheEnabled (#8514)

* IGNITE-13802 Added missing "setCandidatePageCount" in "GridCacheOffheapManager.addPartitions" - Fixes #8527.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-10655 .NET: Add IgniteConfiguration.JavaPeerClassLoadingEnabled

* IGNITE-13633 Fixed ServiceDescriptor#serviceClass failure in case of service deployed through UriDeploymentSpi (#8431)

* IGNITE-13808 Failure handling disabled for index validation. - Fixes #8535.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13697 Schedule and cancel control utility commands for defragmentation feature - Fixes #8449.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13811 Fixed bug with removing wrong key from pingMap in ServerImpl. - Fixes #8539.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13813 Fixed assertion in page snapshot apply method. - Fixes #8541.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13812 Fixed possible ClassCastException on checkpoint start with disabled WAL. - Fixes #8540.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-8884 .NET: Fix async key-val operations - use WriteObjectDetached

Fix async cache operations when key and value objects reference each other or have references to the same object.
Async key-val operations used `WriteObject` instead `WriteObjectDetached`, so references to the same inner object were shared in the binary stream (referenced object is written once). However, cache stores key and val binary objects separately, so the reference to the inner object gets broken.

`WriteObjectDetached` disables reference sharing and writes both object independently.

* IGNITE-13320 Cache encryption key rotation CLI management - Fixes #8242.

Signed-off-by: Aleksey Plekhanov <plehanov.alex@gmail.com>

* IGNITE-13825: Fix precision and scale for columns in SQL result set

This closes #8551

* IGNITE-10075 .NET Avoid binary configurations of Ignite Java service params (#8509)

* IGNITE-13827 Java thin client: Fixed hang on ComputeTask returning unregistered type - Fixes #8552.

Signed-off-by: Aleksey Plekhanov <plehanov.alex@gmail.com>

* IGNITE-13709 Control.sh API - status command for defragmentation feature - Fixes #8548.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13775 checkpointRWLock wrapper refactoring - Fixes #8516.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-13814 restorePartitionStates moved to sys pool instead of striped pool. - Fixes #8542.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13801: Fix Ab Initio related ODBC issues

This closes #8528

* IGNITE-13713 Add target encoding preprocessor (#8466)

* IGNITE-13714 Add catboost inference integration (#8489)

* IGNITE-13353 Got rid of unnecessary rebalance on starting new cache.

Signed-off-by: Slava Koptilin <slava.koptilin@gmail.com>

* IGNITE-13823 WAL iterator WRITE permission requirement removed. - Fixes #8549.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13450 [MINOR] Added missed javadoc for EVT_CACHE_QUERY_EXECUTED event.

* IGNITE-13786 Add defragmentation-specific B+Tree optimizations - Fixes #8560.

Signed-off-by: Alexey Goncharuk <alexey.goncharuk@gmail.com>

* IGNITE-13826 .NET: Add RendezvousAffinityFunction.BackupFilter

Add RendezvousAffinityFunction.BackupFilter with a single predefined implementation that delegates to Java: ClusterNodeAttributeAffinityBackupFilter.

* IGNITE-13833 More versions added to PersistenceBasicCompatibilityTest - Fixes #8562.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-13832 Proper handling of interrupted exceptions in disco-notifier-worker. - Fixes #8561.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13101 Metastore should complete all write futures during stop and prohibit creating new ones - Fixes #8554.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13815 Remove ability to delete segments from the middle of WAL archive - Fixes #8545.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-12892 WAL archive size configuration made more clear - Fixes #8550.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13838 IgniteSqlSplitterSelfTest fixes various tests - Fixes #8565.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* ignite docs: fixing a broken documentation link

* ignite docs: updated the index page with quick links to the APIs and examples

* ignite docs: fixed broken links and updated the C++ API header

* IGNITE-12666 Provide cluster performance profiling tool (#7693)

* ignite docs: fixed case of GitHub

* IGNITE-13743 JMX API for Defragmentation monitoring and management - Fixes #8496.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13848 Fixed incorrect updating of SegmentReservationStorage#minReserveIdx when truncating WAL segments. Fixes #8573

Signed-off-by: Slava Koptilin <slava.koptilin@gmail.com>

* IGNITE-13847 GridEncryptionManager#onWalSegmentRemoved should be invoked async - Fixes #8576.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-13876 Updated documentation for 2.9.1 release (#8592)

* IGNITE-13865 Support  DateTime as a key or value in .NET and Java (#8580)

* IGNITE-13880 Fix PageMemoryTracker related flaky tests - Fixes #8597.

Signed-off-by: Aleksey Plekhanov <plehanov.alex@gmail.com>

* IGNITE-13766 API for network connectivity check - Fixes #8500.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13864 Fixed an issue where acknowledge on a stale latch could lead to assertion error. Fixes #8579

Signed-off-by: Slava Koptilin <slava.koptilin@gmail.com>

* IGNITE-13869 Added additional logging for a query mapping. Fixes #8585

Signed-off-by: Slava Koptilin <slava.koptilin@gmail.com>

* IGNITE-13867 Fixed an issue related to erroneous sending TTL update requests. Fixes #8583

Signed-off-by: Slava Koptilin <slava.koptilin@gmail.com>

* IGNITE-13870 Removed obsolete GridCacheAdapter#validateCacheKey. Fixes #8586

Signed-off-by: Slava Koptilin <slava.koptilin@gmail.com>

* IGNITE-13720 Parallelism for defragmentation added. - Fixes #8574.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13831 Move WAL archive cleanup from checkpoint to rollover - Fixes #8563.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-13866 validate_indexes command is interrupted if connection to initiator is broken. Fixes #8593

Signed-off-by: Slava Koptilin <slava.koptilin@gmail.com>

* IGNITE-13868 Added additional tests related to simultaneously created caches. Fixes #8584

Signed-off-by: Slava Koptilin <slava.koptilin@gmail.com>

* IGNITE-13896 Fix javadoc build failure - Fixes #8601.

Signed-off-by: Aleksey Plekhanov <plehanov.alex@gmail.com>

* IGNITE-12824 .NET: Add BinaryConfiguration.TimestampConverter (#8568)

Co-authored-by: Pavel Tupitsyn <ptupitsyn@apache.org>

* IGNITE-13900: Fix C++ Affinity tests (#8605)

* IGNITE-13708 Add thin client support for Spring Transactions - Fixes #8556.

Signed-off-by: Aleksey Plekhanov <plehanov.alex@gmail.com>

* IGNITE-13910 Missing segment is not released - Fixes #8612.

Signed-off-by: Sergey Chugunov <sergey.chugunov@gmail.com>

* IGNITE-13908: ODBC nullability info for columns

This closes #8610

* IGNITE-13507 Fix NullPointerException on tx recovery - Fixes #8547.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-13734 .NET: Register service return type on method invocation (#8602)

* IGNITE-13856 Linear performance for DirectByteBufferStreamImplV2.writeString - Fixes #8577.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-13555 Java thin: add IPv6 address support

- Change HostAndPortRange.parse method to support addresses like [IPv6_host]:port1..port2, because previous implementation didn't recognized IPv6.
- Add tests for HostAndPortRange.parse method for both IPv4 and IPv6 hosts.

* IGNITE-13680 Improve OS suggestions for Linux - Fixes #8503.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

* IGNITE-11406 Fix NullPointerException on client start - Fixes #8604.

Signed-off-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>

Co-authored-by: Alexey Zinoviev <zaleslaw.sin@gmail.com>
Co-authored-by: Peter Ivanov <mr.weider@gmail.com>
Co-authored-by: Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
Co-authored-by: Alexander Lapin <lapin1702@gmail.com>
Co-authored-by: Pavel Pereslegin <xxtern@gmail.com>
Co-authored-by: Pavel Tupitsyn <ptupitsyn@apache.org>
Co-authored-by: Igor Sapego <isapego@apache.org>
Co-authored-by: ibessonov <bessonov.ip@gmail.com>
Co-authored-by: korlov42 <korlov@gridgain.com>
Co-authored-by: Aleksandr Shapkin <ashapkin@gridgain.com>
Co-authored-by: Aleksey Plekhanov <Plehanov.Alex@gmail.com>
Co-authored-by: Nikolay <nizhikov@apache.org>
Co-authored-by: zstan <stanilovsky@gmail.com>
Co-authored-by: Mark Andreev <mark.andreev@gmail.com>
Co-authored-by: sergeyuttsel <uttsel@gmail.com>
Co-authored-by: Slava Koptilin <slava.koptilin@gmail.com>
Co-authored-by: Kirill Tkalenko <tkalkirill@yandex.ru>
Co-authored-by: Semyon Danilov <samvimes@yandex.ru>
Co-authored-by: Nikita Safonov <73828260+Nikita-tech-writer@users.noreply.github.com>
Co-authored-by: Denis Magda <dmagda@gridgain.com>
Co-authored-by: Nikita Amelchev <nsamelchev@gmail.com>
Co-authored-by: ymolochkov <molochkovyn@gmail.com>
Co-authored-by: vd_pyatkov <vldpyatkov@gmail.com>
Co-authored-by: Anton Kalashnikov <kaa.dev@yandex.ru>
Co-authored-by: Mikhail Petrov <pmgheap.sbt@gmail.com>
Co-authored-by: pvinokurov <vinokurov.pasha@gmail.com>
Co-authored-by: Ilya Kazakov <kazakov.ilya@gmail.com>
Co-authored-by: Varvara Kozhukhova <53300653+vkozhukhova@users.noreply.github.com>
Co-authored-by: shubin <nyshubin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants