Skip to content
Permalink
Browse files
Merge pull request #379 from apache/Add_HeapifyWrapCompactSketch2
Add heapify wrap compact sketch2
  • Loading branch information
leerho committed Jan 6, 2022
2 parents aa427f9 + 6e5f959 commit a80eb6251e78d7380a738b187a6e5d136c8cd2be
Show file tree
Hide file tree
Showing 18 changed files with 1,337 additions and 420 deletions.
@@ -39,7 +39,7 @@ If you are interested in making contributions to this site please see our [Commu

---

## Build Instructions
## Maven Build Instructions
__NOTE:__ This component accesses resource files for testing. As a result, the directory elements of the full absolute path of the target installation directory must qualify as Java identifiers. In other words, the directory elements must not have any space characters (or non-Java identifier characters) in any of the path elements. This is required by the Oracle Java Specification in order to ensure location-independent access to resources: [See Oracle Location-Independent Access to Resources](https://docs.oracle.com/javase/8/docs/technotes/guides/lang/resources.html)

### A JDK8 with Hotspot through JDK13 with Hotspot is required to compile
@@ -88,3 +88,55 @@ There is one run-time dependency:
#### Testing
See the pom.xml file for test dependencies.

## Special Build / Test Instructions for Eclipse

Building and running tests using JDK 8 should not be a problem.

However, with JDK 9+, and Eclipse version up to and including 4.22.0 (2021-12), Eclipse fails to translate the required JPMS JVM arguments specified in the POM into the *.classpath* file, causing illegal reflection access errors.

There are two ways to fix this:

#### Manually update *.classpath* file:
Open the *.classpath* file in a text editor and insert the following *classpathentry* element (this assumes JDK11, change to suit) then *refresh*.:

```
<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-11">
<attributes>
<attribute name="module" value="true"/>
<attribute name="add-exports" value="java.base/jdk.internal.misc=ALL-UNNAMED:java.base/jdk.internal.ref=ALL-UNNAMED"/>
<attribute name="add-opens" value="java.base/java.nio=ALL-UNNAMED:java.base/sun.nio.ch=ALL-UNNAMED"/>
<attribute name="maven.pomderived" value="true"/>
</attributes>
</classpathentry>
```

#### Manually update *Module Dependencies*

In Eclipse, open the project *Properties / Java Build Path / Module Dependencies ...*

* Select *java.base*
* Select *Configured details*
* Select *Expose Package...*
* Enter *Package* = java.nio
* Enter *Target module* = ALL-UNNAMED
* Select button: *opens*
* Hit *OK*
* Select *Expose Package...*
* Enter *Package* = jdk.internal.misc
* Enter *Target module* = ALL-UNNAMED
* Select button: *exports*
* Hit *OK*
* Select *Expose Package...*
* Enter *Package* = jdk.internal.ref
* Enter *Target module* = ALL-UNNAMED
* Select button: *exports*
* Hit *OK*
* Select *Expose Package...*
* Enter *Package* = sun.nio.ch
* Enter *Target module* = ALL-UNNAMED
* Select button: *opens*
* Hit *OK*


**NOTE:** If you execute *Maven/Update Project...* from Eclipse with the option *Update project configuration from pom.xml* checked, all of the above will be erased, and you will have to redo it.

@@ -695,7 +695,7 @@ under the License.
<useManifestOnlyJar>false</useManifestOnlyJar>
<redirectTestOutputToFile>true</redirectTestOutputToFile>
<reportsDirectory>${project.build.directory}/test-output/${maven.build.timestamp}</reportsDirectory>
<argLine>
<argLine>@{argLine}
--add-exports java.base/jdk.internal.misc=ALL-UNNAMED
--add-exports java.base/jdk.internal.ref=ALL-UNNAMED
--add-opens java.base/java.nio=ALL-UNNAMED
@@ -19,25 +19,249 @@

package org.apache.datasketches.theta;

import static org.apache.datasketches.Family.idToFamily;
import static org.apache.datasketches.Util.DEFAULT_UPDATE_SEED;
import static org.apache.datasketches.theta.PreambleUtil.COMPACT_FLAG_MASK;
import static org.apache.datasketches.theta.PreambleUtil.EMPTY_FLAG_MASK;
import static org.apache.datasketches.theta.PreambleUtil.FAMILY_BYTE;
import static org.apache.datasketches.theta.PreambleUtil.FLAGS_BYTE;
import static org.apache.datasketches.theta.PreambleUtil.ORDERED_FLAG_MASK;
import static org.apache.datasketches.theta.PreambleUtil.READ_ONLY_FLAG_MASK;
import static org.apache.datasketches.theta.PreambleUtil.SER_VER_BYTE;
import static org.apache.datasketches.theta.PreambleUtil.extractSeedHash;
import static org.apache.datasketches.theta.SingleItemSketch.otherCheckForSingleItem;

import org.apache.datasketches.Family;
import org.apache.datasketches.SketchesArgumentException;
import org.apache.datasketches.Util;
import org.apache.datasketches.memory.Memory;
import org.apache.datasketches.memory.WritableMemory;

/**
* The parent class of all the CompactSketches. CompactSketches are never created directly.
* They are created as a result of the compact() method of an UpdateSketch or as a result of a
* getResult() of a SetOperation.
* They are created as a result of the compact() method of an UpdateSketch, a result of a
* getResult() of a SetOperation, or from a heapify method.
*
* <p>A CompactSketch is the simplest form of a Theta Sketch. It consists of a compact list
* (i.e., no intervening spaces) of hash values, which may be ordered or not, a value for theta
* and a seed hash. A CompactSketch is read-only,
* and a seed hash. A CompactSketch is immutable (read-only),
* and the space required when stored is only the space required for the hash values and 8 to 24
* bytes of preamble. An empty CompactSketch consumes only 8 bytes.</p>
*
* @author Lee Rhodes
*/
public abstract class CompactSketch extends Sketch {
private static final short defaultSeedHash = Util.computeSeedHash(DEFAULT_UPDATE_SEED);

/**
* Heapify takes a CompactSketch image in Memory and instantiates an on-heap CompactSketch.
*
* <p>The resulting sketch will not retain any link to the source Memory and all of its data will be
* copied to the heap CompactSketch.</p>
*
* <p>This method assumes that the sketch image was created with the correct hash seed, so it is not checked.
* The resulting on-heap CompactSketch will be given the seedHash derived from the given sketch image.
* However, Serial Version 1 sketch images do not have a seedHash field,
* so the resulting heapified CompactSketch will be given the hash of the DEFAULT_UPDATE_SEED.</p>
*
* @param srcMem an image of a CompactSketch.
* <a href="{@docRoot}/resources/dictionary.html#mem">See Memory</a>.
* @return a CompactSketch on the heap.
*/
public static CompactSketch heapify(final Memory srcMem) {
final int serVer = srcMem.getByte(SER_VER_BYTE) & 0XFF;
final int familyID = srcMem.getByte(FAMILY_BYTE) & 0XFF;
final Family family = Family.idToFamily(familyID);
if (family != Family.COMPACT) {
throw new IllegalArgumentException("Corrupted: " + family + " is not Compact!");
}
if (serVer == 3) { //no seed check
final int flags = PreambleUtil.extractFlags(srcMem);
final boolean srcOrdered = (flags & ORDERED_FLAG_MASK) != 0;
return CompactOperations.memoryToCompact(srcMem, srcOrdered, null);
}
//not SerVer 3, assume compact stored form
if (serVer == 1) {
return ForwardCompatibility.heapify1to3(srcMem, defaultSeedHash);
}
if (serVer == 2) {
final short srcSeedHash = (short) extractSeedHash(srcMem);
return ForwardCompatibility.heapify2to3(srcMem, srcSeedHash);
}
throw new SketchesArgumentException("Unknown Serialization Version: " + serVer);
}

/**
* Heapify takes a CompactSketch image in Memory and instantiates an on-heap CompactSketch.
*
* <p>The resulting sketch will not retain any link to the source Memory and all of its data will be
* copied to the heap CompactSketch.</p>
*
* <p>This method checks if the given expectedSeed was used to create the source Memory image.
* However, SerialVersion 1 sketch images cannot be checked as they don't have a seedHash field,
* so the resulting heapified CompactSketch will be given the hash of the expectedSeed.</p>
*
* @param srcMem an image of a CompactSketch that was created using the given expectedSeed.
* <a href="{@docRoot}/resources/dictionary.html#mem">See Memory</a>.
* @param expectedSeed the seed used to validate the given Memory image.
* <a href="{@docRoot}/resources/dictionary.html#seed">See Update Hash Seed</a>.
* @return a CompactSketch on the heap.
*/
public static CompactSketch heapify(final Memory srcMem, final long expectedSeed) {
final int serVer = srcMem.getByte(SER_VER_BYTE);
final byte familyID = srcMem.getByte(FAMILY_BYTE);

final Family family = idToFamily(familyID);
if (family != Family.COMPACT) {
throw new IllegalArgumentException("Corrupted: " + family + " is not Compact!");
}
if (serVer == 3) {
final int flags = PreambleUtil.extractFlags(srcMem);
final boolean srcOrdered = (flags & ORDERED_FLAG_MASK) != 0;
final boolean empty = (flags & EMPTY_FLAG_MASK) != 0;
if (!empty) { PreambleUtil.checkMemorySeedHash(srcMem, expectedSeed); }
return CompactOperations.memoryToCompact(srcMem, srcOrdered, null);
}
//not SerVer 3, assume compact stored form
final short seedHash = Util.computeSeedHash(expectedSeed);
if (serVer == 1) {
return ForwardCompatibility.heapify1to3(srcMem, seedHash);
}
if (serVer == 2) {
return ForwardCompatibility.heapify2to3(srcMem, seedHash);
}
throw new SketchesArgumentException("Unknown Serialization Version: " + serVer);
}

/**
* Wrap takes the CompactSketch image in given Memory and refers to it directly.
* There is no data copying onto the java heap.
* The wrap operation enables fast read-only merging and access to all the public read-only API.
*
* <p>Only "Direct" Serialization Version 3 (i.e, OpenSource) sketches that have
* been explicitly stored as direct sketches can be wrapped.
* Wrapping earlier serial version sketches will result in a heapify operation.
* These early versions were never designed to "wrap".</p>
*
* <p>Wrapping any subclass of this class that is empty or contains only a single item will
* result in heapified forms of empty and single item sketch respectively.
* This is actually faster and consumes less overall memory.</p>
*
* <p>This method assumes that the sketch image was created with the correct hash seed, so it is not checked.
* However, Serial Version 1 sketch images do not have a seedHash field,
* so the resulting on-heap CompactSketch will be given the hash of the DEFAULT_UPDATE_SEED.</p>
*
* @param srcMem an image of a Sketch.
* <a href="{@docRoot}/resources/dictionary.html#mem">See Memory</a>.
* @return a CompactSketch backed by the given Memory except as above.
*/
public static CompactSketch wrap(final Memory srcMem) {
final int serVer = srcMem.getByte(SER_VER_BYTE) & 0XFF;
final int familyID = srcMem.getByte(FAMILY_BYTE) & 0XFF;
final Family family = Family.idToFamily(familyID);
if (family != Family.COMPACT) {
throw new IllegalArgumentException("Corrupted: " + family + " is not Compact!");
}
if (serVer == 3) {
if (PreambleUtil.isEmptyFlag(srcMem)) {
return EmptyCompactSketch.getHeapInstance(srcMem);
}
final short memSeedHash = (short) extractSeedHash(srcMem);
if (otherCheckForSingleItem(srcMem)) { //SINGLEITEM?
return SingleItemSketch.heapify(srcMem, memSeedHash);
}
//not empty & not singleItem
final int flags = srcMem.getByte(FLAGS_BYTE);
final boolean compactFlag = (flags & COMPACT_FLAG_MASK) > 0;
if (!compactFlag) {
throw new SketchesArgumentException(
"Corrupted: COMPACT family sketch image must have compact flag set");
}
final boolean readOnly = (flags & READ_ONLY_FLAG_MASK) > 0;
if (!readOnly) {
throw new SketchesArgumentException(
"Corrupted: COMPACT family sketch image must have Read-Only flag set");
}
return DirectCompactSketch.wrapInstance(srcMem, memSeedHash);
} //end of serVer 3
else if (serVer == 1) {
return ForwardCompatibility.heapify1to3(srcMem, defaultSeedHash);
}
else if (serVer == 2) {
final short memSeedHash = (short) extractSeedHash(srcMem);
return ForwardCompatibility.heapify2to3(srcMem, memSeedHash);
}
throw new SketchesArgumentException(
"Corrupted: Serialization Version " + serVer + " not recognized.");
}

/**
* Wrap takes the sketch image in the given Memory and refers to it directly.
* There is no data copying onto the java heap.
* The wrap operation enables fast read-only merging and access to all the public read-only API.
*
* <p>Only "Direct" Serialization Version 3 (i.e, OpenSource) sketches that have
* been explicitly stored as direct sketches can be wrapped.
* Wrapping earlier serial version sketches will result in a heapify operation.
* These early versions were never designed to "wrap".</p>
*
* <p>Wrapping any subclass of this class that is empty or contains only a single item will
* result in heapified forms of empty and single item sketch respectively.
* This is actually faster and consumes less overall memory.</p>
*
* <p>This method checks if the given expectedSeed was used to create the source Memory image.
* However, SerialVersion 1 sketches cannot be checked as they don't have a seedHash field,
* so the resulting heapified CompactSketch will be given the hash of the expectedSeed.</p>
*
* @param srcMem an image of a Sketch that was created using the given expectedSeed.
* <a href="{@docRoot}/resources/dictionary.html#mem">See Memory</a>
* @param expectedSeed the seed used to validate the given Memory image.
* <a href="{@docRoot}/resources/dictionary.html#seed">See Update Hash Seed</a>.
* @return a CompactSketch backed by the given Memory except as above.
*/
public static CompactSketch wrap(final Memory srcMem, final long expectedSeed) {
final int serVer = srcMem.getByte(SER_VER_BYTE) & 0XFF;
final int familyID = srcMem.getByte(FAMILY_BYTE) & 0XFF;
final Family family = Family.idToFamily(familyID);
if (family != Family.COMPACT) {
throw new IllegalArgumentException("Corrupted: " + family + " is not Compact!");
}
final short seedHash = Util.computeSeedHash(expectedSeed);

if (serVer == 3) {
if (PreambleUtil.isEmptyFlag(srcMem)) {
return EmptyCompactSketch.getHeapInstance(srcMem);
}
if (otherCheckForSingleItem(srcMem)) { //SINGLEITEM?
return SingleItemSketch.heapify(srcMem, seedHash);
}
//not empty & not singleItem
final int flags = srcMem.getByte(FLAGS_BYTE);
final boolean compactFlag = (flags & COMPACT_FLAG_MASK) > 0;
if (!compactFlag) {
throw new SketchesArgumentException(
"Corrupted: COMPACT family sketch image must have compact flag set");
}
final boolean readOnly = (flags & READ_ONLY_FLAG_MASK) > 0;
if (!readOnly) {
throw new SketchesArgumentException(
"Corrupted: COMPACT family sketch image must have Read-Only flag set");
}
return DirectCompactSketch.wrapInstance(srcMem, seedHash);
} //end of serVer 3
else if (serVer == 1) {
return ForwardCompatibility.heapify1to3(srcMem, seedHash);
}
else if (serVer == 2) {
return ForwardCompatibility.heapify2to3(srcMem, seedHash);
}
throw new SketchesArgumentException(
"Corrupted: Serialization Version " + serVer + " not recognized.");

}


//Sketch
//Sketch Overrides

@Override
public abstract CompactSketch compact(final boolean dstOrdered, final WritableMemory dstMem);
@@ -19,10 +19,10 @@

package org.apache.datasketches.theta;

import static org.apache.datasketches.Util.checkSeedHashes;
import static org.apache.datasketches.theta.CompactOperations.checkIllegalCurCountAndEmpty;
import static org.apache.datasketches.theta.CompactOperations.memoryToCompact;
import static org.apache.datasketches.theta.PreambleUtil.ORDERED_FLAG_MASK;
import static org.apache.datasketches.theta.PreambleUtil.checkMemorySeedHash;
import static org.apache.datasketches.theta.PreambleUtil.extractCurCount;
import static org.apache.datasketches.theta.PreambleUtil.extractFlags;
import static org.apache.datasketches.theta.PreambleUtil.extractPreLongs;
@@ -60,16 +60,16 @@ class DirectCompactSketch extends CompactSketch {
* Wraps the given Memory, which must be a SerVer 3, ordered, CompactSketch image.
* Must check the validity of the Memory before calling. The order bit must be set properly.
* @param srcMem <a href="{@docRoot}/resources/dictionary.html#mem">See Memory</a>
* @param seed The update seed.
* <a href="{@docRoot}/resources/dictionary.html#seed">See Update Hash Seed</a>.
* @param seedHash The update seedHash.
* <a href="{@docRoot}/resources/dictionary.html#seedHash">See Seed Hash</a>.
* @return this sketch
*/
static DirectCompactSketch wrapInstance(final Memory srcMem, final long seed) {
checkMemorySeedHash(srcMem, seed);
static DirectCompactSketch wrapInstance(final Memory srcMem, final short seedHash) {
checkSeedHashes((short) extractSeedHash(srcMem), seedHash);
return new DirectCompactSketch(srcMem);
}

//Sketch
//Sketch Overrides

@Override
public CompactSketch compact(final boolean dstOrdered, final WritableMemory dstMem) {

0 comments on commit a80eb62

Please sign in to comment.