-
Notifications
You must be signed in to change notification settings - Fork 23
Conversation
@@ -45,7 +45,7 @@ | |||
|
|||
df.filter(df.airportName >= 'Moscow').select("_id",'airportName').show() | |||
df.filter(df._id >= 'CAA').select("_id",'airportName').show() | |||
df.filter(df._id >= 'CAA').select("_id",'airportName').save("airportcodemapping_df", | |||
df.filter(df._id >= 'CAA').select("_id",'airportName').write.save("airportcodemapping_df", | |||
"com.cloudant.spark", bulkSize = "100") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement did not work for me to demonstrate the new behavior. I had to modify to:
df.filter(df._id >= 'ZZZ').select("_id",'airportName').write.save("airportcodemapping_df", "com.cloudant.spark", bulkSize = "100")
in order to get
[WARN] [06/02/2016 17:32:50.954] [Thread-3] [CloudantReadWriteRelation(akka://CloudantSpark-a3f15ff1-f340-43eb-a7c5-891fb21f45bb)] Database airportcodemapping_df: nothing was saved because the number of records was 0!
The original query with >= CAA
would still give me the good INFO message, like:
[INFO] [06/02/2016 17:27:43.275] [Executor task launch worker-0] [JsonStoreDataAccess(akka://CloudantSpark-4a0b1607-c66d-491f-b8d8-e6d4cad61cf1)] Save total 13 with bulkSize 100 in 0s
What behavior did you want to demonstrate here? Pass with or fail with warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HolgerKache Holger, this patch was intended when number of partitions > number of documents to be saved. In your case, the database is very small, it uses just one partition, so save function works fine.
Example to demonstrate the work of the patch:
conf.set("jsonstore.rdd.partitions",20) #using 20 partitions
df = sqlContext.load("n_flight", "com.cloudant.spark") # big enough database to use all 20 partitions
df2 = df.filter(df.flightSegmentId=='AA106').select("flightSegmentId", "economyClassBaseCost") #df2 contains only 5 docs
#this will throw an error without the patch as not every partition will have at least 1 doc to save
df2.write.save("n_flight111", "com.cloudant.spark",
bulkSize = "100", createDBOnSave="true")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand. My tests use a factor of 10 only for the load and as a result there are too few airport codes to work with: python -m helpers.dataload -load 10
I like the example you provide above based on flights (mostly because it has more comments). Want to deliver that instead of airportcodemapping?
@mayya-sharipova The new code works well and looks good. Only the samples may need some improvements. E.g., examples/python/CloudantDF.py depends on a database I don't have. Where is that database supposed to come from?
results in
|
Another problem I have is with: examples/scala/src/main/scala/mytest/spark/CloudantDF.scala Here we set When I have a database a) throw an expected exception
b) executions hangs and does not return - unexpected This ^ is probably unrelated to the code changes in this ticket but a problem nonetheless. |
@HolgerKache About your last two comments:
|
d3669a6
to
45a65db
Compare
Don't raise an error if any partition has 0 records Update an example with save operation reflecting new API Solve an issue with an application freezing on error BugzID: 67396
45a65db
to
5e194ea
Compare
Don't raise an error if any partition has 0 records
Update an example with save operation reflecting new API
BugzID: 67396