-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MongoDB JSON Import - not able to group by column with array value #869
Comments
Clearly at least one of your rows has more than one value in the While crate is very strict with single values and arrays (string vs. array(string)) on inserts and updates, we made a tradeoff between performance and accuracy towards performance when it comes to importing data. The usual field validation does not take place using the So you have to make sure that the client_mac field from your mongodb collection only contains a single value, not an array. |
We have v. small mongo machines collecting the data and not much control over what's imported. Having said that, it's the same sources so the macs shouldn't be anything other than strings. It took 5 mins to export, a couple to gzip and I cancelled the job after 5 minutes importing about 250,000 rows (on a crappy dev. machine to test). The problem would be scrubbing the data everyday (around 20mill each day) would be too time consuming. But I might take a performance hit upon importing the data - especially as we might want to clean other fields first. We had problems with ES and mac addresses previously because of the colons - even tried replacing with hypens, no success. Mac addresses are fun... |
I think the problem with ES and mac addresses is that they analyze the column by default, stripping numbers and colons and stuff. crate by default analyzes its columns with the keyword analyzer, interpreting the whole content as one term. So colons and hence mac addresses shouldn't be a problem with crate. Can you provide your Looking forward to getting your mongo data correctly into crate! :) |
Yeah us too! Looks excellent though so far. That sounds about right, for our production application we've had to create a multi-field mapping for the macs - one unindexed as it breaks stuff.. Sample JSON here: https://gist.github.com/simonmorley/8dd81b1456d1d366dcff The create command would have been something like this:
Looking at it, I wonder if I could create with ap_mac as a non-indexed string? |
I've delete the table and recreated with no index on that field and am just waiting for them to import. Don't want to kill it in case I truncate a row prematurely so will sit it out. On a virtual machine running 4Gb ram and an SSD, it's taken 15mins + to copy 1million rows. Can't currently test the query while it's importing though - waits and waits understandably. |
ah, i can see clearly now, the rain has gone! We actually didn't make that as clear as it should be in the documentation. You should not use a fulltext index on your mac address fields as you shouldn't need to do fulltextsearch on them. Here is a table schema, that should work for you: create table streams (
created_at timestamp,
ap_mac string,
rssi integer,
client_mac,
secret string,
id string primary key
) with (number_of_replicas=0); using the fulltext index will use the standard analyzer from elasticsearch - so you would copy the same bad behaviour for mac addresses from elasticsearch. :( Another advice: you should create your table with |
haha. I can't see clearly, it's too sunny here! Ok, that makes good sense. The docs are good but obvs. missing the backup of a load of community support (yet). That's how I just imported it :) 4 million down. Thanks for the tip re. replicas. Am not sure how that's going to fare importing this much every day. Do you have any guidance for working with big data sets? Whilst I still don't really consider what we're doing 'big data' and I hate that term, I think we're going to be around 6-8Tb in a year. |
The documentation is updated in master, see #874. Can you close the issue if the new schema works for you? Let's discuss via IRC on all the other topics. :) |
Deal, see you there :) Import worked fine, thanks. |
I've direct imported from a JSON Mongo dump. I'm tying to group by a column (type string) which happens to be a mac address.
There's 20million+ rows of data and when I try and group by the mac, I get an error:
The query:
Was just on IRC and the suggestion was that there's a rogue mac which may not be a string - however, with so many rows, I can't scrub before importing.
Any suggestions?
The text was updated successfully, but these errors were encountered: