Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write array from pig #177

Closed
oalam opened this Issue Mar 25, 2014 · 12 comments

Comments

Projects
None yet
3 participants
@oalam
Copy link

commented Mar 25, 2014

How is it possible to write from PIG to ES data containing an array of strings like

{
"name":"toto",
"tags": ["tag1", "tag2"]
}

I've tried with bags and tuples but it always ends with schema names inside the array ?

@aortez

This comment has been minimized.

Copy link

commented Apr 15, 2014

I also am having problems with writing a string array from Pig to ES. The current ES-hadoop documentation (http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/pig.html) states that a pig Bag maps to a ES Array, but I am not getting the results I expect.

Here is a full example:

% this is pseudo grunt code here:
data = LOAD 'test_data' USING PigStorage (',') AS (f1: chararray, f2: chararray, f3: chararray);
data2 = FOREACH data GENERATE TOBAG(f1, f2, f3) AS my_fields;
STORE data2 INTO 'dgb-1610/test' USING EsStorage();

dump data2
({(A),(B),(C)})

describe data2
data2: {my_fields: {(chararray)}}
# this is what the resulting index looks like:
curl -XGET 'http://gblb-es01.dev.g2:9200/dgb-1610/test/_mapping'
{
    "test": {
        "properties": {
            "my_fields": {
                "properties": {
                    "0": {
                        "type": "string"
                    }
                }
            }
        }
    }
}

# but this is how I expect it to be:
{
    "test": {
        "properties": {
            "my_fields": {
                "type": "string"
                }
            }
        }
    }
}

@costin costin added bug and removed rest labels Apr 16, 2014

@costin

This comment has been minimized.

Copy link
Member

commented Apr 16, 2014

Hi guys,

Sorry it took a while to get to this. The problem is caused by the fact that bags themselves contain tuples, and each tuple can (and does) contain a schema indicating its field names and their types.
The mapping above is the result of handling both tuples with and without schema in a consistent manner.

However I can see how this might not create issues in case a basic array is needed so I'll try to come up with the fix. Note that writing the tuple is 'simple' format is easy, reading it back it's not (since the tuple type needs to be figured out).

@aortez

This comment has been minimized.

Copy link

commented Apr 16, 2014

Great! Thanks for looking into the issue Costin!

@costin

This comment has been minimized.

Copy link
Member

commented Apr 17, 2014

Guys, I've pushed a draft update in the nightly builds - can you please try it out and report back?
I'd like to run more tests to make sure it's solid, but so far the relevant tests are passing.

@costin

This comment has been minimized.

Copy link
Member

commented Apr 17, 2014

P.S. The upload maven artifact looks something like this elasticsearch-hadoop-1.3.0.BUILD-20140417.223030-390.jar

@aortez

This comment has been minimized.

Copy link

commented Apr 18, 2014

Hi Costin. I tried the snapshot build you specified and the results are better, but still not quite right.

Using the same example I posted above:

% 'test_data' = A,B,C
data = LOAD 'test_data' USING PigStorage (',') AS (f1: chararray, f2: chararray, f3: chararray);
data2 = FOREACH data GENERATE TOBAG(f1, f2, f3) AS my_fields;
STORE data2 INTO 'dgb-1610/test' USING EsStorage();

dump data2
({(A),(B),(C)})

The better part is that the mapping now looks correct:

$ curl -XGET 'http://gblb-es01.dev.g2:9200/dgb-1610/test/_mapping' | python -mjson.tool
{
    "test": {
        "properties": {
            "my_fields": {
                "type": "string"
            }
        }
    }
}

The not-quite-right part shows up when we look at the data though:

curl -XGET 'http://gblb-es01.dev.g2:9200/dgb-1610/test/_search' | python -mjson.tool
...
                "_source": {
                    "my_fields": [
                        [
                            "A"
                        ], 
                        [
                            "B"
                        ], 
                        [
                            "C"
                        ]
                    ]
                }, 
...

But I expect it to look like so:

...
                "_source": {
                    "my_fields": [
                         "A",
                         "B",
                         "C"
                    ]
                }, 
...

BTW, to pull down that build I had to change
http://oss.sonatype.org/content/repositories/releases
(the above url was specified in the Development Builds section of the Installation page
to
http://oss.sonatype.org/content/repositories/snapshots
in my pom. I might be doing something wrong here...

Thanks Costin!

@costin

This comment has been minimized.

Copy link
Member

commented Apr 18, 2014

({(A),(B),(C)}) is a bag of tuples. A tuple can have or multiple elements and are an ordered list of values - hence their representation as JSON arrays.
so (A) becomes [A], (B) -> [B] and so on. One could argue that the array is not needed for tuples with one elements but consider the following example:
(A) -> A
(A,B) -> [A, B]
If we nest the tuple as per your example:
({(A)}) -> [A] and ({(A), (B)}) -> [A, B]

Notice the JSON representation is the same between a tuple with two values and a bag with two tuples.

@aortez

This comment has been minimized.

Copy link

commented Apr 18, 2014

Ok, thanks for the explanation Costin.

It looks like my expectations were incorrect... and it sounds like maybe I will not be able to load data in the exact structure I was hoping for - it will have to be an array with each element in its own nested array, e.g.: "my_fields": [ ["A"], ["B"] ], as opposed to "my_fields": [ "A", "B" ]. Right?

@costin

This comment has been minimized.

Copy link
Member

commented Apr 18, 2014

You can get the JSON structure you need but not with a bag. The crux of the problem is that Pig uses tuple as its 'atom' and provide other complex data structures on top. And since a tuple can (and will) have multiple entries, it means an array (which ES can handle just fine) needs to be the basic mapping 'atom'.
If you use es-hadoop to read/write data to ES, the structure shouldn't matter in the end.
However if you want to share the JSON with somebody else then, to get only a list, try get rid of the bag and simply write a tuple, that is rather then write ({(A),(B),(C)}) (a bag of tuples), write (A,B,C) a basic tuple.
You can achieve this by 'flattening' the bag - see the Pig manual for more information.

costin added a commit that referenced this issue Apr 18, 2014

Improve handling of Pig Tuples
change default serialization of tuples to hide/ignore their names. this
results in tuples being pure arrays/lists vs maps (name : list of values)
relates #177
@aortez

This comment has been minimized.

Copy link

commented Apr 21, 2014

Hey Costin. It looks like with the snapshot build you specified, I am able to do as you suggest to get an array without field names, but some of the other behavior has also changed (with regard to the M2 release). It seems that it is no longer possible for any nested tuple to be named.

I think the following describes the behavior I am seeing:

  1. if a tuple is at the root level, then it is named
  2. if a tuple is at any other level, it is not named

Here is an example demonstrating this behavior:

-- data = A,B,C
data = LOAD 'test_data' USING PigStorage (',') AS (f1: chararray, f2: chararray, f3: chararray);
data2 = FOREACH data GENERATE f1, TOTUPLE(TOTUPLE(f1, f2), TOTUPLE(f3)) AS kitty: tuple(names, f3);
STORE data2 INTO 'dgb-1611/testSNAPSHOT_3' USING EsStorage();

Here is the behavior of the snapshot build (elasticsearch-hadoop-1.3.0.BUILD-20140417.223030-390.jar). As we can see, the names and f3 fields are not named, but the f1 and kitty fields are:

$ curl -XGET 'http://...:9200/dgb-1610/testSNAPSHOT/_search' | python -mjson.tool
...
                "_source": {
                    "f1": "A", 
                    "kitty": [
                        [
                            "A", 
                            "B"
                        ], 
                        "C"
                    ]
                }, 
...

And if we go back to the M2 build, all of the fields are named (this is the build I was using when I originally chimed in on this ticket):

$ curl -XGET 'http://...:9200/dgb-1610/testM2/_search' | python -mjson.tool
...
                "_source": {
                    "f1": "A", 
                    "kitty": {
                        "f3": "C", 
                        "names": {
                            "f1": "A", 
                            "f2": "B"
                        }
                    }
                }, 
...

I am trying to create something like this:

                "_source": {
                    "f1": "A", 
                    "kitty": {
                        names: [
                            "A", 
                            "B"
                        ], 
                        f3: "C"
                    }
                }, 

I should clarify my use case. I am replacing an old Java-based ETL with a Pig-based one, and I am trying to exactly replicate the structure of the index it created. And of course, thank you for your time!

@costin

This comment has been minimized.

Copy link
Member

commented Apr 22, 2014

There root tuple that you refer to is the actual row/entry in Pig that is mapped to a JSON document. That's why it needs to use names since otherwise its JSON representation would be invalid.
If you apply describe to data2 it will probably look something like this:

data2 = FOREACH data GENERATE f1, TOTUPLE(TOTUPLE(f1, f2), TOTUPLE(f3)) AS kitty: tuple(names, f3);
STORE data2 INTO 'dgb-1611/testSNAPSHOT_3' USING EsStorage();
f1:chararray, kitty t(t:(chararray, chararray), t:(chararray))

Placing your mapping aside, within the same structure the tool should use names for some of your nested tuples but use lists for others - what's the criteria? Further more, how would it know to deserialize the same JSON back into Pig?

If you want a dedicated mapping, trying working your way backwards - instead of using named tuples, use maps - keep using f1 as a chararray, define kitty as a map with one key "names" for a tuple, and the other a chararray.

You can still use tuples but there are two ways of dealing with it - without names, in which case they are an array of primitives, with names in which case they are converted into an array of maps (each entry will be converted to a map - field:tuple entry).

Hope this helps,

@costin costin closed this in 18030db May 8, 2014

@oalam

This comment has been minimized.

Copy link
Author

commented May 9, 2014

thanks costin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.