Skip to content

Conversation

mpenick
Copy link
Contributor

@mpenick mpenick commented Jan 20, 2017

No description provided.

@stamhankar999
Copy link
Contributor

This looks great! My only gripe is that when adding a new type to the system, we have to update three files: the CASS_VALUE_TYPE_MAPPING in cassandra.h and the two .gperf files. The former actually has some of the same info as the value_types_by_cql.gperf file.

Consider this alternative:

  • Add the class-names in CASS_VALUE_TYPE_MAPPING
  • Just as we have the DataType::native_types_ class variable, add a map variable for mapping class-names to DataType's (or the enum value), and add a map variable for mapping cql-type-names to DataType's.
  • Have the DataTypeInitializer initialize the maps

If it's an STL map, we'll have log(n) access, which is, of course, worse than the O(1) hash access provided by gperf. If that's a concern, maybe there's a general hash-map impl we can leverage. It may not be perfect hashing, but I think that would be good enough.

DataTypeInitializer() {
// Add a reference so that the memory is never deleted
#define XX_VALUE_TYPE(name, type, cql) (new(&DataType::native_types_[name]) DataType(name))->inc_ref();
CASS_VALUE_TYPE_MAPPING(XX_VALUE_TYPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're instantiating into the class array (which is already allocated), under what circumstance do you have to worry about the memory being deallocated? Process exit?

Copy link
Contributor Author

@mpenick mpenick Jan 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't allocate heap memory. This is a "placement new".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I get that this is "placement new", but I don't understand why the inc_ref is needed.

Copy link
Contributor Author

@mpenick mpenick Jan 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoids the reference count from ever reaching zero because the memory is statically allocated. If the reference count reaches zero then it would be freed using delete which would be incorrect.

ValueTypeByClassMapping* mapping =
ValueTypeByClass::in_word_set(name.c_str(), name.length());
if (mapping == NULL) return DataType::NIL;
return DataType::ConstPtr(&native_types_[mapping->value_type]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the above inc_ref keep the ConstPtr from deallocating the array's memory (when the ConstPtr eventually is destroyed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

#!/bin/sh
# 'register' is a deprecated keyword in C++1x
gperf -t value_types_by_class.gperf | sed 's/register //g' > value_types_by_class.hpp
gperf -t value_types_by_cql.gperf | sed 's/register //g' > value_types_by_cql.hpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user calls the script from outside of the src dir, the generated files won't go to the src directory. How about something like this (untested):

DIRNAME=`dirname "$0"`
for i in value_types_by_class value_types_by_cql ; do
  gperf -t "$DIRNAME/$i.gperf" | sed 's/register //g' > "$DIRNAME/$i.hpp"
done

No one should be insane enough to use spaces in their directory paths, but the extra quoting above handles it if they do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with the no spaces in directory paths; Windows still loves spaces and a good number of those users still utilizes spaces in their paths especially when their user directory is their fullname which may contain spaces.

These are my concerns:

  1. Using a shell script to generate these files requires that we also make a batch file for Windows; would prefer that we utilize CMake to handle the generation instead
  2. Committing generated files could lead to a situation where we have forgotten to manually generate the files and commit; granted chances are low
  3. Not making this an automated step during the build process turns this into a manual process when we should probably automate it
  1. and 3) are kind of the same and I am not 100% against keeping this generation manual and committing the files when required as the amount of times these will change is extremely low.

@mpenick
Copy link
Contributor Author

mpenick commented Jan 23, 2017

I'm not concerned about O(log(n)) vs O(1). I'm trying to avoid heap allocation from being done in the static context. We did this a bunch when we first release the driver and had clients that had issues. I'm trying to remember what the issues were (I'll get back to you when I remember or find the issues). At first I wasn't even going to use static initialization (because this can have issues too). I was going to use a pattern similar to SSL's initialization, but this seems to be simple enough of a use case and doesn't have any heap allocation.

@mpenick
Copy link
Contributor Author

mpenick commented Jan 23, 2017

Side note: We could have moved std::map todense_hash_map for O(1) runtime complexity. This is something I highly recommend with code moving forward, but not in static initialization stuff because it allocates heap memory. It makes sense in almost every case where you would use std::map to favor dense_hash_map as long as you can find a suitable empty key and/or deleted key.

@mpenick
Copy link
Contributor Author

mpenick commented Jan 23, 2017

I remember the issue. Allocating heap memory in static initialization prevents users from allocating that memory using their custom heap allocator. Static initialization usually happens when the library is loaded, before the client application has had a chance to run, so the client application is unable to set their allocator for memory allocated during that prior static initialization.

Custom allocation is something we plan to support, in short order, because it has been requested by several major users. The issue: https://datastax-oss.atlassian.net/browse/CPP-360

@mpenick
Copy link
Contributor Author

mpenick commented Jan 23, 2017

I agree, it's not great having that same information duplicated in a few files. Maybe those .gperf files can be generated on-the-fly, as an intermediate step, using the information fromcassandra.h?

@mpenick
Copy link
Contributor Author

mpenick commented Jan 23, 2017

Heh, Since you both thumbs up it do you want to write the generator? :)


DataType::ConstPtr SimpleDataTypeCache::by_value_type(uint16_t value_type) {
if (value_type == CASS_VALUE_TYPE_UNKNOWN ||
value_type <= CASS_VALUE_TYPE_LAST_ENTRY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be >=?

// etc) so these data types don't need to be allocated/deallocated over and over.
// `DataType` is reference counted so it could lead to mulitple threads modifying
// a shared reference count. To mitigate this sharing, thread IDs are used
// to distribute mulitple copies of the same data type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mulitple => multiple

Love the comment blocks here and above.

@mpenick mpenick force-pushed the 428 branch 5 times, most recently from 518d8fc to bd11338 Compare January 27, 2017 16:17
* Made the string to type mapping for C* types more generalized
  (it was only meant to be used by schema metadata) and faster using
  a statically allocated hash table.
* Realized that using `CASS_<type>_MAP()` was bad naming as "MAP" could
  refer to a "map" data/value type. Changed it to `CASS_<type>_MAPPING()`.
  Backwards compatible macros were added to avoid breaking existing
  applications.
* Realized `FixedVector` was bad naming. `SmallVector` better describes
  its purpose.
@mpenick
Copy link
Contributor Author

mpenick commented Jan 27, 2017

This is finally ready to review again. This iteration doesn't require having value type information in multiple places and is simpler. This uses dense_hash_map with a static allocator so its a hybrid between the two approaches.

}
DataType::ConstPtr& data_type = cache_[value_type];
if (!data_type) {
data_type = DataType::ConstPtr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't cache_ be updated here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, duh! Makes sense.


class SimpleDataTypeCache {
public:
const DataType::ConstPtr& by_class(StringRef name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not build up this cache at the beginning of time similar to ValueTypes and make these methods static (and make cache_ static)? Then you wouldn't need to pass a SimpleTypeTypeCache to various methods below...

Copy link
Contributor Author

@mpenick mpenick Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really concerned about sharing reference counted objects on multiple threads. With this method we get some sharing without causing multiple cores to thrash on those shared reference counts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need reference counting for this? What if it's a singleton initialized at the beginning of time and consumers get the elements they need? The consumers don't even try to delete what they've got, and when the application terminates, the singleton will be destroyed then. by_class and similar methods could return a const DataType& or const DataType*; the former would make it less likely that a consumer would try to destroy the DataType.

Copy link
Contributor Author

@mpenick mpenick Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the external API and other places in the code treat more complex and unique data types (collections, UDTs, tuples, etc.) as generic DataType. Even if we return these as const DataType& or const DataType* they're going to eventually be put into a DataType::ConstPtr which is going to modify the shared reference count.

// which can be found at the bottom of "sparesehash/internal/densehashtable.h".
#define OCCUPANCY_PCT 50

#define MIN_BUCKETS(N) STATIC_NEXT_POW_2((((N * 100) / OCCUPANCY_PCT) + 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning behind this calculation of MIN_BUCKETS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to statically allocate enough buckets to hold N items without requiring additional allocations. A hash table doesn't just allocate N buckets to hold N items because of hash collisions so it allocates a number of buckets over the current number of items (controlled by a load factor) to compensate for that.

OCCUPANCY_PCT is dense_hash_map's load factor and it needs the number of buckets to be a power of two because it uses a mask and & to wrap around instead of using modulo (which is expensive).

This calculates the number of buckets considering the load factor: (N * 100) / OCCUPANCY_PCT (The + 1 is to allocate at least one bucket) then it rounds up to the next power of two for the wrap around trick.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coolness. Makes sense. They key thing I was missing was that the wrap around is done by mask and requires num-buckets to be a power of 2.

typedef sparsehash::dense_hash_map<K, V, HashFcn, EqualKey, Allocator> HashMap;

SmallDenseHashMap()
: HashMap(N, typename HashMap::hasher(), typename HashMap::key_equal(), Allocator(&fixed_)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use typename here but not in the other constructor? Also, aren't these actual values, not types, so we shouldn't have typename here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing compiler errors. Sometimes C++ compiler get confused and need told that a name refers to a type. Perhaps it should be added to both, and the second constructor didn't error because it's not being currently used?

Copy link
Contributor Author

@mpenick mpenick Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is parsed as (typename HashMap::hasher)() not typename (HashMap::hasher()).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh... that does make more sense.

BOOST_CHECK(strlen(klass) == 0 || \
cass::ValueTypes::by_class(klass) == name); \

CASS_VALUE_TYPE_MAPPING(XX_VALUE_TYPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat use of the macro!

Copy link
Contributor

@stamhankar999 stamhankar999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all questions have been resolved, except about the ref counting in SimpleDataTypeCache.

}
DataType::ConstPtr& data_type = cache_[value_type];
if (!data_type) {
data_type = DataType::ConstPtr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, duh! Makes sense.


class SimpleDataTypeCache {
public:
const DataType::ConstPtr& by_class(StringRef name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need reference counting for this? What if it's a singleton initialized at the beginning of time and consumers get the elements they need? The consumers don't even try to delete what they've got, and when the application terminates, the singleton will be destroyed then. by_class and similar methods could return a const DataType& or const DataType*; the former would make it less likely that a consumer would try to destroy the DataType.

// which can be found at the bottom of "sparesehash/internal/densehashtable.h".
#define OCCUPANCY_PCT 50

#define MIN_BUCKETS(N) STATIC_NEXT_POW_2((((N * 100) / OCCUPANCY_PCT) + 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coolness. Makes sense. They key thing I was missing was that the wrap around is done by mask and requires num-buckets to be a power of 2.

typedef sparsehash::dense_hash_map<K, V, HashFcn, EqualKey, Allocator> HashMap;

SmallDenseHashMap()
: HashMap(N, typename HashMap::hasher(), typename HashMap::key_equal(), Allocator(&fixed_)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh... that does make more sense.

@mpenick
Copy link
Contributor Author

mpenick commented Feb 1, 2017

Merged at 3c55e33

@mpenick mpenick closed this Feb 1, 2017
@mikefero mikefero deleted the 428 branch February 3, 2017 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants