Functions and macros that make HPCC easier and faster to code in. Currently broken down into transform tools for data transformation and string tools for standard string tidying.
- Documentation
- Installation
- Important Notes
- Transform Tools
- String Tools
- Example Workflow
- Issues, Bugs, Comments?
The package's github is available at: https://github.com/OdinProAgrica/dapper
This package is released under GNU GPLv3 License: https://www.gnu.org/licenses/gpl-3.0.en.html
Dapper is available as an ECL bundle. Something that we only recently learned is that HPCC actually supports libraries in a
similar way to Python's pip. It isn't quite as fully featured but it works well enough. The basic idea is that it will
pull down a 'bundle' of ECL code from github and install it locally making it available as
if it was part of the core libraries. That is, you can IMPORT
it without having to have the scripts in your repo.
This tool is made available with ecl.exe (and its Linux equivalent). To install the latest dapper release you simply run the following on the command line (although I hear that the latest IDE has this baked in).
ecl bundle install -v https://github.com/OdinProAgrica/dapper.git
There are many more bundles available, but their coverage, testing and adherence to version control varies. HPCC keep a curated list here.
If you want a specific version use that version's branch, for details see the help in ecl bundle install.
Copy the dapper folder into an ECL repository (or add the folder to your IDE's environment), you can then import the relevant modules. You can get zips of each version in the releases section of the github: https://github.com/OdinProAgrica/dapper/releases
Do note that the standard nomenclature for importing the modules should be respected. This is because each module references its own functions, requiring a known import name. Modules should always be imported as:
- TransformTools: tt
- StringTools: st
For example:
IMPORT dapper.TransformTools as tt;
inDS and OutDS are reserved! You may find you get weird errors if you use these variable names, this will be fixed in a later release.
Once installed, dapper can be used to create scripts using simple verbs which can increase readability and decrease coding
mistakes. In short, it reduces the amount of time you spend thinking about how to do a job, you can just get on with it. Don’t
get me wrong, sometimes an old school PROJECT
is better. I leave it to the reader to decide what they like. As always,
right tools, right job. It is worth noting however that, even if you use several transformTools statements in a row, the
compiler is clever enough to combine this into a single operation under the bonnet, minimising dapper's speed impact.
The bundle itself is broken down into two sets of tools:
Okay, so I'm going to give you an ECL script which (assuming you have installed dapper using the bundle method described above) will work on your system (fingers crossed!).
IMPORT dapper.ExampleData;
IMPORT dapper.TransformTools as tt;
//load data
StarWars := ExampleData.starwars;
// Look at the data
tt.nrows(StarWars);
tt.head(StarWars);
//Fill blank species with unknown
fillblankHome := tt.mutate(StarWars, species, IF(species = '', 'Unkn.', species));
tt.head(fillblankHome);
//Create a BMI for each character
bmi := tt.append(fillblankHome, REAL, BMI, mass/height^2);
tt.head(bmi);
//Find the highest
sortedBMI := tt.arrange(bmi, '-bmi');
tt.head(sortedBMI);
//Jabba should probably go on a diet.
//How many of each species are there?
species := tt.countn(sortedBMI, 'species');
sortedspecies := tt.arrange(species, '-n');
tt.head(sortedspecies);
//Finally let's look at unique hair/eye colour combinations:
colourData := tt.select(StarWars, 'hair_color, eye_color');
unqiueColours := tt.distinct(colourData, 'hair_color, eye_color');
//see arrangedistinct() for fancy sort/dedup
tt.head(unqiueColours);
//and save our results
tt.to_csv(sortedBMI, 'ROB::TEMP::STARWARSCSV');
IMPORT dapper.ExampleData;
IMPORT dapper.TransformTools as tt;
//load data
StarWars := ExampleData.starwars;
Yes, I'm a nerd, what of it?
// Look at the data
tt.nrows(StarWars);
tt.head(StarWars);
These are shorthand for OUTPUT
and COUNT
however note the way the results are named. It automagically renames your results
to match your variables.
//Fill blank species with unknown
fillblankHome := tt.mutate(StarWars, species, IF(species = '', 'Unkn.', species));
tt.head(fillblankHome);
//Create a BMI for each character
bmi := tt.append(fillblankHome, REAL, BMI, mass/height^2);
tt.head(bmi);
//Find the highest
sortedBMI := tt.arrange(bmi, '-bmi');
tt.head(sortedBMI);
//Jabba should probably go on a diet.
Mutate and append allow column transforms via a simple formula. Note that there is no need for SELF
or LEFT
in these
transforms, making them easier to write and easier to read!
//How many of each species are there?
species := tt.countn(sortedBMI, 'species');
sortedspecies := tt.arrange(species, '-n');
tt.head(sortedspecies);
The record definition to do a cross-tab is something I always have to look up. countn
will do it for you, you can
even hand multiple columns to it and it'll handle them all perfectly.
//Finally let's look at unique hair/eye colour combinations:
colourData := tt.select(StarWars, 'hair_color, eye_color');
unqiueColours := tt.distinct(colourData, 'hair_color, eye_color');
//see arrangedistinct() for fancy sort/dedup
tt.head(unqiueColours);
select
(and it's partner function drop
) will help in quickly sub-setting data, note too the use of distinct which is
shorthand for DEDUP(SORT(DISTRIBUTE(...), LOCAL), LOCAL)
, using arrangedistinct()
instead allows you to control the sort
and distribute commands separately. See also duplicated
which will flag all duplicates with a boolean for investigation!
Also, sorry for the UK English, sometimes I can't help myself!
//and save our results
tt.to_csv(sortedBMI, 'ROB::TEMP::STARWARSCSV');
Finally we can write out. OUTPUT(...CSV(...))
is another one of those functions you always forget, this handles all the quote,
separator, header stuff for you and simply writes out a 'normal' csv file. It'll also add the tilde (~) to the start
of your file name if you forget.
That's a quick whistle stop tour of dapper's power and functionality. I hope you can see how useful it is for things like testing and investigative work, allowing a more logical flow and readable code.
Glad you asked, see below:
IMPORT dapper.ExampleData;
//load data
StarWars := ExampleData.starwars;
// Look at the data
OUTPUT(COUNT(StarWars), NAMED('COUNTstarWars'));
OUTPUT(StarWars, NAMED('starWars'));
//Fill blank species with unknown
//Create a BMI for each character
fillblankHomeAndBMI :=
PROJECT(StarWars,
TRANSFORM({RECORDOF(LEFT); REAL BMI;},
SELF.BMI := LEFT.mass / LEFT.Height^2;
SELF.species := IF(LEFT.species = '', 'Unkn.', LEFT.species);
SELF := LEFT;));
OUTPUT(fillblankHomeAndBMI, NAMED('fillblankHomeAndBMI'));
//Find the highest
sortedBMI := SORT(fillblankHomeAndBMI, -bmi);
OUTPUT(sortedBMI, NAMED('sortedBMI'));
//Jabba should probably go on a diet.
//How many of each species are there?
CountRec := RECORD
STRING Species := sortedBMI.species;
INTEGER n := COUNT(GROUP);
END;
species := TABLE(sortedBMI, CountRec, species);
sortedspecies := SORT(species, -n);
OUTPUT(sortedspecies, NAMED('sortedspecies'));
//Finally let's look at unique hair/eye colour combinations:
colourData := TABLE(sortedBMI, {hair_color, eye_color});
unqiueColours := DEDUP(SORT(DISTRIBUTE(colourData, HASH(hair_color)), hair_color, eye_color, LOCAL), hair_color, eye_color, LOCAL);
OUTPUT(COUNT(unqiueColours), NAMED('COUNTunqiueColours'));
OUTPUT(unqiueColours, NAMED('unqiueColours'));
//and save our results
OUTPUT(sortedBMI, , 'ROB::TEMP::STARWARSCSV', CSV(HEADING(SINGLE), SEPARATOR(','), TERMINATOR('\n'), QUOTE('"')));
If you do have any comments or suggestions please do checkout the github issues page here.