428 - Use global primitive data types instead of local data type caches #334

mpenick · 2017-01-20T21:03:11Z

No description provided.

stamhankar999 · 2017-01-21T02:48:55Z

This looks great! My only gripe is that when adding a new type to the system, we have to update three files: the CASS_VALUE_TYPE_MAPPING in cassandra.h and the two .gperf files. The former actually has some of the same info as the value_types_by_cql.gperf file.

Consider this alternative:

Add the class-names in CASS_VALUE_TYPE_MAPPING
Just as we have the DataType::native_types_ class variable, add a map variable for mapping class-names to DataType's (or the enum value), and add a map variable for mapping cql-type-names to DataType's.
Have the DataTypeInitializer initialize the maps

If it's an STL map, we'll have log(n) access, which is, of course, worse than the O(1) hash access provided by gperf. If that's a concern, maybe there's a general hash-map impl we can leverage. It may not be perfect hashing, but I think that would be good enough.

stamhankar999 · 2017-01-21T01:28:51Z

src/data_type.cpp

+  DataTypeInitializer() {
+    // Add a reference so that the memory is never deleted
+#define XX_VALUE_TYPE(name, type, cql) (new(&DataType::native_types_[name]) DataType(name))->inc_ref();
+    CASS_VALUE_TYPE_MAPPING(XX_VALUE_TYPE)


Since you're instantiating into the class array (which is already allocated), under what circumstance do you have to worry about the memory being deallocated? Process exit?

This doesn't allocate heap memory. This is a "placement new".

Yeah, I get that this is "placement new", but I don't understand why the inc_ref is needed.

I avoids the reference count from ever reaching zero because the memory is statically allocated. If the reference count reaches zero then it would be freed using delete which would be incorrect.

stamhankar999 · 2017-01-21T01:32:07Z

src/data_type.cpp

+  ValueTypeByClassMapping* mapping =
+    ValueTypeByClass::in_word_set(name.c_str(), name.length());
+  if (mapping == NULL) return DataType::NIL;
+  return DataType::ConstPtr(&native_types_[mapping->value_type]);


Does the above inc_ref keep the ConstPtr from deallocating the array's memory (when the ConstPtr eventually is destroyed)?

stamhankar999 · 2017-01-21T02:01:42Z

src/generate_value_types.sh

+#!/bin/sh
+# 'register' is a deprecated keyword in C++1x
+gperf -t value_types_by_class.gperf | sed 's/register //g' > value_types_by_class.hpp
+gperf -t value_types_by_cql.gperf | sed 's/register //g' > value_types_by_cql.hpp


If a user calls the script from outside of the src dir, the generated files won't go to the src directory. How about something like this (untested):

DIRNAME=`dirname "$0"` for i in value_types_by_class value_types_by_cql ; do gperf -t "$DIRNAME/$i.gperf" | sed 's/register //g' > "$DIRNAME/$i.hpp" done

No one should be insane enough to use spaces in their directory paths, but the extra quoting above handles it if they do.

I disagree with the no spaces in directory paths; Windows still loves spaces and a good number of those users still utilizes spaces in their paths especially when their user directory is their fullname which may contain spaces.

These are my concerns:

Using a shell script to generate these files requires that we also make a batch file for Windows; would prefer that we utilize CMake to handle the generation instead

Committing generated files could lead to a situation where we have forgotten to manually generate the files and commit; granted chances are low

Not making this an automated step during the build process turns this into a manual process when we should probably automate it

and 3) are kind of the same and I am not 100% against keeping this generation manual and committing the files when required as the amount of times these will change is extremely low.

mpenick · 2017-01-23T13:37:15Z

I'm not concerned about O(log(n)) vs O(1). I'm trying to avoid heap allocation from being done in the static context. We did this a bunch when we first release the driver and had clients that had issues. I'm trying to remember what the issues were (I'll get back to you when I remember or find the issues). At first I wasn't even going to use static initialization (because this can have issues too). I was going to use a pattern similar to SSL's initialization, but this seems to be simple enough of a use case and doesn't have any heap allocation.

mpenick · 2017-01-23T13:49:34Z

Side note: We could have moved std::map todense_hash_map for O(1) runtime complexity. This is something I highly recommend with code moving forward, but not in static initialization stuff because it allocates heap memory. It makes sense in almost every case where you would use std::map to favor dense_hash_map as long as you can find a suitable empty key and/or deleted key.

mpenick · 2017-01-23T14:14:57Z

I remember the issue. Allocating heap memory in static initialization prevents users from allocating that memory using their custom heap allocator. Static initialization usually happens when the library is loaded, before the client application has had a chance to run, so the client application is unable to set their allocator for memory allocated during that prior static initialization.

Custom allocation is something we plan to support, in short order, because it has been requested by several major users. The issue: https://datastax-oss.atlassian.net/browse/CPP-360

mpenick · 2017-01-23T14:23:35Z

I agree, it's not great having that same information duplicated in a few files. Maybe those .gperf files can be generated on-the-fly, as an intermediate step, using the information fromcassandra.h?

mpenick · 2017-01-23T19:06:50Z

Heh, Since you both thumbs up it do you want to write the generator? :)

stamhankar999 · 2017-01-24T17:28:21Z

src/data_type.cpp

+
+DataType::ConstPtr SimpleDataTypeCache::by_value_type(uint16_t value_type) {
+  if (value_type == CASS_VALUE_TYPE_UNKNOWN ||
+      value_type <= CASS_VALUE_TYPE_LAST_ENTRY) {


Shouldn't this be >=?

stamhankar999 · 2017-01-24T17:31:35Z

src/data_type.hpp

+// etc) so these data types don't need to be allocated/deallocated over and over.
+// `DataType` is reference counted so it could lead to mulitple threads modifying
+// a shared reference count. To mitigate this sharing, thread IDs are used
+// to distribute mulitple copies of the same data type.


mulitple => multiple

Love the comment blocks here and above.

* Made the string to type mapping for C* types more generalized (it was only meant to be used by schema metadata) and faster using a statically allocated hash table. * Realized that using `CASS_<type>_MAP()` was bad naming as "MAP" could refer to a "map" data/value type. Changed it to `CASS_<type>_MAPPING()`. Backwards compatible macros were added to avoid breaking existing applications. * Realized `FixedVector` was bad naming. `SmallVector` better describes its purpose.

Fixed typo.

mpenick · 2017-01-27T18:16:21Z

This is finally ready to review again. This iteration doesn't require having value type information in multiple places and is simpler. This uses dense_hash_map with a static allocator so its a hybrid between the two approaches.

stamhankar999 · 2017-01-30T19:41:14Z

src/data_type.cpp

+  }
+  DataType::ConstPtr& data_type = cache_[value_type];
+  if (!data_type) {
+    data_type = DataType::ConstPtr(


Shouldn't cache_ be updated here?

It's a reference.

Oh, duh! Makes sense.

stamhankar999 · 2017-01-30T19:49:31Z

src/data_type.hpp

+
+class SimpleDataTypeCache {
+public:
+  const DataType::ConstPtr& by_class(StringRef name) {


Why not build up this cache at the beginning of time similar to ValueTypes and make these methods static (and make cache_ static)? Then you wouldn't need to pass a SimpleTypeTypeCache to various methods below...

I'm really concerned about sharing reference counted objects on multiple threads. With this method we get some sharing without causing multiple cores to thrash on those shared reference counts.

Do we need reference counting for this? What if it's a singleton initialized at the beginning of time and consumers get the elements they need? The consumers don't even try to delete what they've got, and when the application terminates, the singleton will be destroyed then. by_class and similar methods could return a const DataType& or const DataType*; the former would make it less likely that a consumer would try to destroy the DataType.

The problem is the external API and other places in the code treat more complex and unique data types (collections, UDTs, tuples, etc.) as generic DataType. Even if we return these as const DataType& or const DataType* they're going to eventually be put into a DataType::ConstPtr which is going to modify the shared reference count.

stamhankar999 · 2017-01-30T21:41:56Z

src/small_dense_hash_map.hpp

+// which can be found at the bottom of "sparesehash/internal/densehashtable.h".
+#define OCCUPANCY_PCT 50
+
+#define MIN_BUCKETS(N) STATIC_NEXT_POW_2((((N * 100) / OCCUPANCY_PCT) + 1))


What is the reasoning behind this calculation of MIN_BUCKETS?

This is to statically allocate enough buckets to hold N items without requiring additional allocations. A hash table doesn't just allocate N buckets to hold N items because of hash collisions so it allocates a number of buckets over the current number of items (controlled by a load factor) to compensate for that.

OCCUPANCY_PCT is dense_hash_map's load factor and it needs the number of buckets to be a power of two because it uses a mask and & to wrap around instead of using modulo (which is expensive).

This calculates the number of buckets considering the load factor: (N * 100) / OCCUPANCY_PCT (The + 1 is to allocate at least one bucket) then it rounds up to the next power of two for the wrap around trick.

Coolness. Makes sense. They key thing I was missing was that the wrap around is done by mask and requires num-buckets to be a power of 2.

stamhankar999 · 2017-01-30T22:02:50Z

src/small_dense_hash_map.hpp

+  typedef sparsehash::dense_hash_map<K, V, HashFcn, EqualKey, Allocator> HashMap;
+
+  SmallDenseHashMap()
+    : HashMap(N, typename HashMap::hasher(), typename HashMap::key_equal(), Allocator(&fixed_)) {


Why use typename here but not in the other constructor? Also, aren't these actual values, not types, so we shouldn't have typename here?

Fixing compiler errors. Sometimes C++ compiler get confused and need told that a name refers to a type. Perhaps it should be added to both, and the second constructor didn't error because it's not being currently used?

Also, this is parsed as (typename HashMap::hasher)() not typename (HashMap::hasher()).

Ohhh... that does make more sense.

stamhankar999 · 2017-01-30T22:15:33Z

test/unit_tests/src/test_data_type.cpp

+  BOOST_CHECK(strlen(klass) == 0 ||                       \
+              cass::ValueTypes::by_class(klass) == name); \
+
+  CASS_VALUE_TYPE_MAPPING(XX_VALUE_TYPE)


Very neat use of the macro!

stamhankar999

I think all questions have been resolved, except about the ref counting in SimpleDataTypeCache.

stamhankar999 · 2017-01-31T18:49:13Z

src/data_type.cpp

+  }
+  DataType::ConstPtr& data_type = cache_[value_type];
+  if (!data_type) {
+    data_type = DataType::ConstPtr(


Oh, duh! Makes sense.

stamhankar999 · 2017-01-31T18:57:06Z

src/data_type.hpp

+
+class SimpleDataTypeCache {
+public:
+  const DataType::ConstPtr& by_class(StringRef name) {


Do we need reference counting for this? What if it's a singleton initialized at the beginning of time and consumers get the elements they need? The consumers don't even try to delete what they've got, and when the application terminates, the singleton will be destroyed then. by_class and similar methods could return a const DataType& or const DataType*; the former would make it less likely that a consumer would try to destroy the DataType.

stamhankar999 · 2017-01-31T19:00:19Z

src/small_dense_hash_map.hpp

+// which can be found at the bottom of "sparesehash/internal/densehashtable.h".
+#define OCCUPANCY_PCT 50
+
+#define MIN_BUCKETS(N) STATIC_NEXT_POW_2((((N * 100) / OCCUPANCY_PCT) + 1))


Coolness. Makes sense. They key thing I was missing was that the wrap around is done by mask and requires num-buckets to be a power of 2.

stamhankar999 · 2017-01-31T19:01:16Z

src/small_dense_hash_map.hpp

+  typedef sparsehash::dense_hash_map<K, V, HashFcn, EqualKey, Allocator> HashMap;
+
+  SmallDenseHashMap()
+    : HashMap(N, typename HashMap::hasher(), typename HashMap::key_equal(), Allocator(&fixed_)) {


Ohhh... that does make more sense.

mpenick · 2017-02-01T17:08:51Z

Merged at 3c55e33

…on (#334)

stamhankar999 reviewed Jan 21, 2017

View reviewed changes

stamhankar999 reviewed Jan 24, 2017

View reviewed changes

mpenick force-pushed the 428 branch 5 times, most recently from 518d8fc to bd11338 Compare January 27, 2017 16:17

mpenick force-pushed the 428 branch from bd11338 to 171a9ef Compare January 27, 2017 16:42

428 - Improved C* string to type mapping

ff14364

Fixed typo.

stamhankar999 reviewed Jan 30, 2017

View reviewed changes

stamhankar999 reviewed Jan 31, 2017

View reviewed changes

mpenick closed this Feb 1, 2017

mikefero deleted the 428 branch February 3, 2017 21:24

mpenick pushed a commit that referenced this pull request Dec 10, 2019

CPP-859 - Remove vc_build.bat scripts and update building documentati…

d164080

…on (#334)

428 - Use global primitive data types instead of local data type caches #334

428 - Use global primitive data types instead of local data type caches #334

Uh oh!

Conversation

mpenick commented Jan 20, 2017

Uh oh!

stamhankar999 commented Jan 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpenick commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpenick commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpenick commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpenick commented Jan 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick commented Jan 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick Jan 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick Jan 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick Jan 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick Jan 23, 2017 •

edited

Loading

mpenick Jan 23, 2017 •

edited

Loading

mpenick commented Jan 23, 2017 •

edited

Loading

mpenick commented Jan 23, 2017 •

edited

Loading

mpenick commented Jan 23, 2017 •

edited

Loading

mpenick commented Jan 23, 2017 •

edited

Loading

mpenick Jan 31, 2017 •

edited

Loading

mpenick Jan 31, 2017 •

edited

Loading

mpenick Jan 31, 2017 •

edited

Loading