Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HIVEMALL-145] Merge Brickhouse functions #135

Closed
wants to merge 56 commits into from

Conversation

@myui
Copy link
Member

myui commented Feb 27, 2018

What changes were proposed in this pull request?

Merge brickhouse functions.

What type of PR is it?

Feature

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-145

How was this patch tested?

unit tests and manual tests

How to use this feature?

as described in user guide.

Checklist

  • Did you apply source code formatter, i.e., mvn formatter:format, for your commit?
  • Did you run system tests on Hive (or Spark)?
  • Invite active/main Brickhouse developers as Hivemall PPMC members or committers.
    klout/brickhouse#149
  • +1 from Klout members to merge
@myui

This comment has been minimized.

Copy link
Member Author

myui commented Feb 27, 2018

Still WIP for reviewing functions to merge.

@myui myui force-pushed the myui:merge_brickhouse branch 2 times, most recently from a9e59ac to dddd73e Mar 20, 2018
@myui

This comment has been minimized.

Copy link
Member Author

myui commented Mar 27, 2018

select 
  NAMED_STRUCT("Name", "John", "age", 31),
  to_json(
     NAMED_STRUCT("Name", "John", "age", 31)
  ),
  to_json(
     NAMED_STRUCT("Name", "John", "age", 31),
     array('Name', 'age')
  ),
  to_json(
     NAMED_STRUCT("Name", "John", "age", 31),
     array('name', 'age')
  ),
  to_json(
     NAMED_STRUCT("Name", "John", "age", 31),
     array('age')
  ),
  to_json(
     NAMED_STRUCT("Name", "John", "age", 31),
     array()
  ),
  to_json(
     null,
     array()
  ),
  to_json(
    struct("123", "456", 789, array(314,007)),
    array('ti','si','i','bi')
  ),
  to_json(
    struct("123", "456", 789, array(314,007)),
    'ti,si,i,bi'
  ),
  to_json(
    struct("123", "456", 789, array(314,007))
  ),
  to_json(
    NAMED_STRUCT("country", "japan", "city", "tokyo")
  ),
  to_json(
    NAMED_STRUCT("country", "japan", "city", "tokyo"), 
    array('city')
  ),
  to_json(
    ARRAY(
      NAMED_STRUCT("country", "japan", "city", "tokyo"), 
      NAMED_STRUCT("country", "japan", "city", "osaka")
    )
  ),
  to_json(
    ARRAY(
      NAMED_STRUCT("country", "japan", "city", "tokyo"), 
      NAMED_STRUCT("country", "japan", "city", "osaka")
    ),
    array('city')
  );

{"name":"John","age":31} {"name":"John","age":31} {"Name":"John","age":31} {"name":"John","age":31} {"age":31} {}NULL {"ti":"123","si":"456","i":789,"bi":[314,7]} {"ti":"123","si":"456","i":789,"bi":[314,7]} {"col1":"123","col2":"456","col3":789,"col4":[314,7]} {"country":"japan","city":"tokyo"} {"city":"tokyo"} [{"country":"japan","city":"tokyo"},{"country":"japan","city":"osaka"}] [{"country":"japan","city":"tokyo"},{"country":"japan","city":"osaka"}]

select
  from_json(
    '{ "person" : { "name" : "makoto" , "age" : 37 } }',
    'struct<name:string,age:int>', 
    array('person')
  ),
  from_json(
    '[0.1,1.1,2.2]',
    'array<double>'
  ),
  from_json(to_json(
    ARRAY(
      NAMED_STRUCT("country", "japan", "city", "tokyo"), 
      NAMED_STRUCT("country", "japan", "city", "osaka")
    )
  ),'array<struct<country:string,city:string>>'),
  from_json(to_json(
    ARRAY(
      NAMED_STRUCT("country", "japan", "city", "tokyo"), 
      NAMED_STRUCT("country", "japan", "city", "osaka")
    ),
    array('city')
  ), 'array<struct<country:string,city:string>>'),
  from_json(to_json(
    ARRAY(
      NAMED_STRUCT("country", "japan", "city", "tokyo"), 
      NAMED_STRUCT("country", "japan", "city", "osaka")
    )
  ),'array<struct<city:string>>');

{"name":"makoto","age":37} [0.1,1.1,2.2] [{"country":"japan","city":"tokyo"},{"country":"japan","city":"osaka"}] [{"country":"japan","city":"tokyo"},{"country":"japan","city":"osaka"}] [{"city":"tokyo"},{"city":"osaka"}]

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 2, 2018

@jeromebanks merging of Brickhouse functions is in-progress in this PR. FYI

We need to add unit test, improve qualities of functions, and add documents.

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 4, 2018

create temporary function moving_avg as 'hivemall.statistics.MovingAverageUDTF';

select moving_avg(x, 3) from (select explode(array(1,2,3,4,5,6,7)) as x) series;
select moving_avg(x, 3) from (select explode(array(1.0,2.0,3.0,4.0,5.0,6.0,7.0)) as x) series;
avg
1.0
1.5
2.0
3.0
4.0
5.0
6.0
@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 5, 2018

@maropu Could you check whether to_json and from_json works on Spark or not if possible? This function would be useful for SparkSQL users who uses JSON.

I'm not sure hcatalog is provided in Spark environment.
dd99307#diff-357e4854869b2e21c38b1b437f11095aR56

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 5, 2018

create temporary function conditional_emit as 'hivemall.tools.array.ConditionalEmitUDTF';

-- multiple conditions in a single scan
WITH input as (
   select array(true, false, true) as conditions, array("one", "two", "three") as features
   UNION ALL
   select array(true, true, false), array("four", "five", "six")
)
select
  conditional_emit(
     conditions, features
  )
from 
  input;
feature
one
three
four
five
@maropu

This comment has been minimized.

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 5, 2018

create temporary function array_slice as 'hivemall.tools.array.ArraySliceUDF';

select 
  array_slice(
   array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
   0, -- offset
   2 -- length
  ),
  array_slice(
   array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
   6, -- offset
   3 -- length
  ),
  array_slice(
   array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
   6, -- offset
   10 -- length
  ),
  array_slice(
   array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
   6 -- offset
  ),
  array_slice(
   array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
   -3 -- offset
  ),
  array_slice(
   array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
   -3, -- offset
   2 -- length
  );

["zero","one"] ["six","seven","eight"] ["six","seven","eight","nine","ten"] ["six","seven","eight","nine","ten"] ["eight","nine","ten"] ["eight","nine"]

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 6, 2018

@maropu Deprecated SubarrayUDF in 7003006 to use ArraySliceUDF instead. FYI

asfgit pushed a commit that referenced this pull request Apr 9, 2018
@paulojblack

This comment has been minimized.

Copy link

paulojblack commented Apr 9, 2018

Just a heads up, I think some of the changes here that were pushed to master recently have broken set-up as instructed in the getting started docs.

Specifically having trouble with the changes in https://github.com/apache/incubator-hivemall/blob/master/resources/ddl/define-all.hive. After commenting out lines 409-413 it works as expected.

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 10, 2018

@paulojblack Thank you for comments. Will confirm it and fix master.

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 10, 2018

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 10, 2018

@paulojblack You need to use up-to-date DDLs since we updated DDLs for subarray UDF in 7003006

By using define-all.hive in master branch, it's working without errors in my environment.

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 10, 2018

If you are using v0.5.0, then you need to use one of v0.5.0.

DDLs are pointing specified release branches in the distribution page.

Installation manual can be improved though.

@paulojblack

This comment has been minimized.

Copy link

paulojblack commented Apr 10, 2018

That makes sense, I figured a change like that wasnt made blindly. Consider it a heads up on the docs then!

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 11, 2018

@paulojblack Generally, we recommend to use Official ASF releases, not one in the master branch.

When you are using the master branch, use the latest DDLs with a caution.
We'll try to assert changes in the release notes though.

@myui myui changed the title [WIP] Merge Brickhouse functions [WIP][HIVEMALL-145] Merge Brickhouse functions Apr 12, 2018
@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 20, 2018

select generate_series(2,4);

value
2
3
4

select generate_series(5,1,-2);

value
5
3
1

select generate_series(4,3);

(no return)

select date_add(current_date(),value) as `date`,value from (select generate_series(1,3)) t;

date    value
2018-04-21      1
2018-04-22      2
2018-04-23      3

WITH input as (
 select 1 as c1, 10 as c2, 3 as step
 UNION ALL
 select 10, 2, -3
)
select generate_series(c1, c2, step) as series from input;

series
1
4
7
10
10
7
4
@myui

This comment has been minimized.

Copy link
Member Author

myui commented Apr 24, 2018

create temporary function merge_maps as 'hivemall.tools.map.MergeMapsUDAF';

create table test as 
 SELECT map('A',10,'B',20,'C',30) as m
 UNION ALL 
 SELECT map('A',11,'D',40,'E',50) as m;

> {"A":11,"B":20,"C":30,"D":40,"E":50}

SELECT merge_maps(m) FROM test;
@myui myui force-pushed the myui:merge_brickhouse branch from 6f5a620 to 607cc4f Apr 27, 2018
myui added 7 commits Mar 20, 2018
…t, last_element)
myui added 8 commits May 24, 2018
@UDFType(deterministic = true, stateful = false)
+ "- Throws HiveException if condition is not met",
extended = "SELECT count(1) FROM stock_price WHERE assert(price > 0.0);\n"
+ "SELECT count(1) FROM stock_price WHRE assert(price > 0.0, 'price MUST be more than 0.0')")

This comment has been minimized.

Copy link
@takuti

takuti May 31, 2018

Member

typo s/WHRE/WHERE/

myui added 10 commits May 31, 2018
@myui

This comment has been minimized.

Copy link
Member Author

myui commented Jun 5, 2018

@jeromebanks I'm considering to merge this PR. Could you review if possible?

@jeromebanks

This comment has been minimized.

Copy link

jeromebanks commented Jun 5, 2018

@myui

This comment has been minimized.

Copy link
Member Author

myui commented Jun 6, 2018

For K-minimum Values (KMV) and Sketch related codes, I'll create an another JIRA ticket.
https://issues.apache.org/jira/browse/HIVEMALL-206

For other UDFs, we accept incoming PRs.
https://docs.google.com/spreadsheets/d/1gtFNcTvPR9OZAsbobj2D9d37tOx4nAoSlib9CLdEDQg/edit#gid=0

myui added 4 commits Jun 6, 2018
@myui

This comment has been minimized.

Copy link
Member Author

myui commented Jun 6, 2018

I'm going to merge this PR to master. If you find any problem, please comment here.

@myui myui changed the title [WIP][HIVEMALL-145] Merge Brickhouse functions [HIVEMALL-145] Merge Brickhouse functions Jun 6, 2018
@asfgit asfgit closed this in 4949603 Jun 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.