[READY] Use struct of arrays for static code_points object #1140

bstaletic · 2018-11-30T17:29:15Z

Introduce RawCodePointArray for FindCodePoint
to iterate of a continous array instead of
an array of structs.

This also allows the arrays in RawCodePointArray
to use char[] instead of const char* and thus
avoids 12MB of relocation data on linux.

This is another attempt of doing #1064.

Table summary from Michel's comment

Platform	Compilation time (s)		Library size (MB)
Platform	Before	After	Before	After
Ubuntu 18.04 64-bit (GCC 7.3.0)	26.56	25.79	19.59	7.32
macOS 10.14 (Apple Clang 10.0.0)	29.16	28.20	7.09	7.30
Windows 7 64-bit (MSVC 15)	30.62	30.39	7.86	6.99

This change is

codecov · 2018-11-30T19:05:14Z

Codecov Report

Merging #1140 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1140      +/-   ##
==========================================
+ Coverage   97.69%   97.69%   +<.01%     
==========================================
  Files          90       90              
  Lines        7058     7069      +11     
==========================================
+ Hits         6895     6906      +11     
  Misses        163      163

micbou

Thanks for the PR. To be thorough, I compared the time it takes to compile the library and the resulting size with and without the proposed changes on platforms we support:

Platform	Compilation time (s)		Library size (MB)
Platform	Before	After	Before	After
Ubuntu 18.04 64-bit (GCC 7.3.0)	26.56	25.79	19.59	7.32
macOS 10.14 (Apple Clang 10.0.0)	29.16	28.20	7.09	7.30
Windows 7 64-bit (MSVC 15)	30.62	30.39	7.86	6.99

The results were obtained by running time ./build.py --no-regex. There is a small decrease in compilation time for all platforms, a huge size reduction on Linux (as expected), a non negligible decrease on Windows, and a minor increase on macOS.

Reviewed 2 of 3 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: 1 of 2 LGTMs obtained

puremourning

Reviewable status: 1 of 2 LGTMs obtained

update_unicode.py, line 51 at r3 (raw file):

std::array< char[{folded_case_size}], {size} > folded_case;
std::array< char[{swapped_case_size}], {size} > swapped_case;
std::array< bool, {size} > is_letter;

I wonder if we should really use a bitset here. A std::array of bools is a large overhead of unused bits. For an array of 10,000 bools:

bash-3.2$ ./array 
Array: Size: 10000
Bitset: Size: 1256

Possible early optimisation, but this would also have better cache coherency as well as saved padding.

puremourning

Reviewable status: 1 of 2 LGTMs obtained

update_unicode.py, line 501 at r3 (raw file):

def CppLength( utf8_code_point ):

I think this could benefit from a comment explaining what it does. It would seem to me that the len( bytes() ) representation is canonical and splitting based on some indicator seems... dodgy ?

update_unicode.py, line 512 at r3 (raw file):

  size = len( code_points )
  original_table = '{{'
  original_size = 0

This section looks dense and liable to errors when changing. Could we present it as generalised data, then run an algorithm on it. Something like

table = {
  'original': { 'output': '{{', 'size': 0, 'converter': CppChar },
 'normal': { ... },
 'is_letter': { 'output': '{{', 'converter': CppBool }
}

for t, d in iteritems( table ):
  entry = code_point[ t ]
  d[ 'output' ] += d[ 'converter' ]( entry ) + ','
  d[ 'size' ] += max( CppLength( entry ), d[ 'size' ] )

Does that a) work, b) seem easier to maintain ?

I think it would be easier to review, as this is currently quite repetitive, which can lead to difficulty in spotting errors.

Introduce RawCodePointArray for FindCodePoint to iterate of a continous array instead of an array of structs. This also allows the arrays in RawCodePointArray to use char[] instead of const char* and thus avoids 12MB of relocation data on linux.

bstaletic

Reviewable status: 1 of 2 LGTMs obtained

update_unicode.py, line 51 at r3 (raw file):

Previously, puremourning (Ben Jackson) wrote…

I wonder if we should really use a bitset here. A std::array of bools is a large overhead of unused bits. For an array of 10,000 bools:
bash-3.2$ ./array 
Array: Size: 10000
Bitset: Size: 1256
Possible early optimisation, but this would also have better cache coherency as well as saved padding.

It's not 10.000, it's 132.624.

Yes, I used . to separate thousands, sue me!

I'm trying it out.

update_unicode.py, line 501 at r3 (raw file):

Previously, puremourning (Ben Jackson) wrote…

I think this could benefit from a comment explaining what it does. It would seem to me that the len( bytes() ) representation is canonical and splitting based on some indicator seems... dodgy ?

Done, though wording might not be the best. Tell me if you want it reworded.

update_unicode.py, line 512 at r3 (raw file):

Previously, puremourning (Ben Jackson) wrote…

This section looks dense and liable to errors when changing. Could we present it as generalised data, then run an algorithm on it. Something like
table = {
  'original': { 'output': '{{', 'size': 0, 'converter': CppChar },
 'normal': { ... },
 'is_letter': { 'output': '{{', 'converter': CppBool }
}

for t, d in iteritems( table ):
  entry = code_point[ t ]
  d[ 'output' ] += d[ 'converter' ]( entry ) + ','
  d[ 'size' ] += max( CppLength( entry ), d[ 'size' ] )
Does that a) work, b) seem easier to maintain ?

I think it would be easier to review, as this is currently quite repetitive, which can lead to difficulty in spotting errors.

Yes, it works and it is much nicer to look at.
I've used table.items() for two reasons: a) I didn't want to pull in python-future dependency, b) the script was already python3 only.
On the other hand, I think, this makes the script take a little longer to execute.

puremourning

Reviewable status: 1 of 2 LGTMs obtained

update_unicode.py, line 538 at r4 (raw file):

    d[ 'output' ] = d[ 'output' ].rstrip( ',' ) + '}},'
    if t == 'combining_class':
      d[ 'output' ] = d[ 'output' ].rstrip( ',' )

will this ever do anything, as you previously added '}}' to the end ?

puremourning

with minor comment

Reviewed 1 of 3 files at r1, 1 of 1 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 2 of 2 LGTMs obtained

bstaletic

Reviewable status: complete! 2 of 2 LGTMs obtained

update_unicode.py, line 51 at r3 (raw file):

Previously, bstaletic (Boris Staletic) wrote…

It's not 10.000, it's 132.624.

Yes, I used . to separate thousands, sue me!

I'm trying it out.

I could get std::bitset to compile but it broke tests. I vote for leaving this as is.

update_unicode.py, line 538 at r4 (raw file):

Previously, puremourning (Ben Jackson) wrote…

will this ever do anything, as you previously added '}}' to the end ?

I added }}, because the arrays in the initializer list need to be comma separated. Then on this line I remove the trailing comma, because the last array isn't followed by anything and can't have a trailing comma.

In other words, if you remove this line it won't generate valid C++.

puremourning

Reviewable status: complete! 2 of 2 LGTMs obtained

update_unicode.py, line 538 at r4 (raw file):

Previously, bstaletic (Boris Staletic) wrote…

I added }}, because the arrays in the initializer list need to be comma separated. Then on this line I remove the trailing comma, because the last array isn't followed by anything and can't have a trailing comma.

In other words, if you remove this line it won't generate valid C++.

erm,... go home ben, you're blind.

carry on.

puremourning

Reviewable status: complete! 2 of 2 LGTMs obtained

bstaletic · 2018-12-07T22:54:14Z

@zzbot r=micbou

zzbot · 2018-12-07T22:54:14Z

📌 Commit 046f37f has been approved by micbou

zzbot · 2018-12-07T23:03:05Z

⌛ Testing commit 046f37f with merge 60c5933...

[READY] Use struct of arrays for static code_points object Introduce RawCodePointArray for FindCodePoint to iterate of a continous array instead of an array of structs. This also allows the arrays in RawCodePointArray to use char[] instead of const char* and thus avoids 12MB of relocation data on linux. This is another attempt of doing #1064. Table summary from [Michel's comment](#1140 (review)) <table> <tr> <th rowspan="2">Platform</th> <th colspan="2">Compilation time (s)</th> <th colspan="2">Library size (MB)</th> </tr> <tr> <td>Before</td> <td>After</td> <td>Before</td> <td>After</td> </tr> <tr> <td>Ubuntu 18.04 64-bit (GCC 7.3.0)</td> <td>26.56</td> <td>25.79</td> <td>19.59</td> <td>7.32</td> </tr> <tr> <td>macOS 10.14 (Apple Clang 10.0.0)</td> <td>29.16</td> <td>28.20</td> <td>7.09</td> <td>7.30</td> </tr> <tr> <td>Windows 7 64-bit (MSVC 15)</td> <td>30.62</td> <td>30.39</td> <td>7.86</td> <td>6.99</td> </tr> </table>  --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/1140)

zzbot · 2018-12-08T02:38:53Z

☀️ Test successful - status-appveyor, status-travis
Approved by: micbou
Pushing 60c5933 to master...

micbou · 2018-12-08T10:49:30Z

update_unicode.py, line 512 at r3 (raw file):

Previously, bstaletic (Boris Staletic) wrote…

Yes, it works and it is much nicer to look at.
I've used table.items() for two reasons: a) I didn't want to pull in python-future dependency, b) the script was already python3 only.
On the other hand, I think, this makes the script take a little longer to execute.

A little? It takes now 3 minutes and 23 seconds instead of 12 seconds for me; 17 times longer. I don't think this was worth the change.

bstaletic · 2018-12-08T12:28:25Z

I didn't measure, but I do remember that I've sat through the script before. Now I do switch to my browser and I know it takes more than 2 minutes. Should we revert the changes in the script?
On the other hand, we run that script very rarely.
@puremourning What are your thoughts?

micbou mentioned this pull request Dec 2, 2018

[READY] Do not hardcode code point lengths bstaletic/ycmd#7

Merged

bstaletic force-pushed the soa branch 2 times, most recently from 9342a56 to 0cf0422 Compare December 4, 2018 13:41

micbou approved these changes Dec 4, 2018

View reviewed changes

puremourning requested changes Dec 7, 2018

View reviewed changes

bstaletic force-pushed the soa branch from 0cf0422 to 046f37f Compare December 7, 2018 13:05

bstaletic commented Dec 7, 2018

View reviewed changes

puremourning requested changes Dec 7, 2018

View reviewed changes

puremourning reviewed Dec 7, 2018

View reviewed changes

bstaletic commented Dec 7, 2018

View reviewed changes

puremourning approved these changes Dec 7, 2018

View reviewed changes

zzbot merged commit 046f37f into ycm-core:master Dec 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[READY] Use struct of arrays for static code_points object #1140

[READY] Use struct of arrays for static code_points object #1140

bstaletic commented Nov 30, 2018 •

edited

Loading

codecov bot commented Nov 30, 2018 •

edited

Loading

micbou left a comment

puremourning left a comment

puremourning left a comment

bstaletic left a comment

puremourning left a comment

puremourning left a comment

bstaletic left a comment

puremourning left a comment

puremourning left a comment

bstaletic commented Dec 7, 2018

zzbot commented Dec 7, 2018

zzbot commented Dec 7, 2018

zzbot commented Dec 8, 2018

micbou commented Dec 8, 2018

bstaletic commented Dec 8, 2018

[READY] Use struct of arrays for static code_points object #1140

[READY] Use struct of arrays for static code_points object #1140

Conversation

bstaletic commented Nov 30, 2018 • edited Loading

codecov bot commented Nov 30, 2018 • edited Loading

Codecov Report

micbou left a comment

Choose a reason for hiding this comment

puremourning left a comment

Choose a reason for hiding this comment

puremourning left a comment

Choose a reason for hiding this comment

bstaletic left a comment

Choose a reason for hiding this comment

puremourning left a comment

Choose a reason for hiding this comment

puremourning left a comment

Choose a reason for hiding this comment

bstaletic left a comment

Choose a reason for hiding this comment

puremourning left a comment

Choose a reason for hiding this comment

puremourning left a comment

Choose a reason for hiding this comment

bstaletic commented Dec 7, 2018

zzbot commented Dec 7, 2018

zzbot commented Dec 7, 2018

zzbot commented Dec 8, 2018

micbou commented Dec 8, 2018

bstaletic commented Dec 8, 2018

bstaletic commented Nov 30, 2018 •

edited

Loading

codecov bot commented Nov 30, 2018 •

edited

Loading