New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql query on Adam's flattened file -> Non-local session path expected to be non-null; #141

Closed
jerryivanhoe opened this Issue Nov 12, 2015 · 4 comments

Comments

Projects
None yet
2 participants
@jerryivanhoe

jerryivanhoe commented Nov 12, 2015

Hi,
I "flattened" an Adam file - storing chromosome 1 of HG 1000genome. Then I wanted to start an SQL-query on this data:

That's what I tried ....

scala> val sqlRDD2 = sqlContext.parquetFile("hdfs:///user/ec2-user/1kg/chr1.adam_flatten”)
sqlRDD2: org.apache.spark.sql.DataFrame = [variant__variantErrorProbability: int, variant__contig__contigName: string, variant__contig__contigLength: bigint, variant__contig__contigMD5: string, variant__contig__referenceURL: string, variant__contig__assembly: string, variant__contig__species: string, variant__contig__referenceIndex: int, variant__start: bigint, variant__end: bigint, variant__referenceAllele: string, variant__alternateAllele: string, variant__svAllele__type: binary, variant__svAllele__assembly: string, variant__svAllele__precise: boolean, variant__svAllele__startWindow: int, variant__svAllele__endWindow: int, variant__isSomatic: boolean, variantCallingAnnotations__variantIsPassing: boolean, variantCallingAnnotations__downsampled: boolean, variantCallingAnnotations__baseQ...

scala> sqlRDD2.printSchema
root
|-- variant__variantErrorProbability: integer (nullable = true)
|-- variant__contig__contigName: string (nullable = true)
|-- variant__contig__contigLength: long (nullable = true)
|-- variant__contig__contigMD5: string (nullable = true)
|-- variant__contig__referenceURL: string (nullable = true)
|-- variant__contig__assembly: string (nullable = true)
|-- variant__contig__species: string (nullable = true)
|-- variant__contig__referenceIndex: integer (nullable = true)
|-- variant__start: long (nullable = true)
|-- variant__end: long (nullable = true)
|-- variant__referenceAllele: string (nullable = true)
|-- variant__alternateAllele: string (nullable = true)
|-- variant__svAllele__type: binary (nullable = true)
|-- variant__svAllele__assembly: string (nullable = true)
|-- variant__svAllele__precise: boolean (nullable = true)
|-- variant__svAllele__startWindow: integer (nullable = true)
|-- variant__svAllele__endWindow: integer (nullable = true)
|-- variant__isSomatic: boolean (nullable = true)
|-- variantCallingAnnotations__variantIsPassing: boolean (nullable = true)
|-- variantCallingAnnotations__downsampled: boolean (nullable = true)
|-- variantCallingAnnotations__baseQRankSum: float (nullable = true)
|-- variantCallingAnnotations__fisherStrandBiasPValue: float (nullable = true)
|-- variantCallingAnnotations__rmsMapQ: float (nullable = true)
|-- variantCallingAnnotations__mapq0Reads: integer (nullable = true)
|-- variantCallingAnnotations__mqRankSum: float (nullable = true)
|-- variantCallingAnnotations__readPositionRankSum: float (nullable = true)
|-- variantCallingAnnotations__vqslod: float (nullable = true)
|-- variantCallingAnnotations__culprit: string (nullable = true)
|-- sampleId: string (nullable = true)
|-- sampleDescription: string (nullable = true)
|-- processingDescription: string (nullable = true)
|-- expectedAlleleDosage: float (nullable = true)
|-- referenceReadDepth: integer (nullable = true)
|-- alternateReadDepth: integer (nullable = true)
|-- readDepth: integer (nullable = true)
|-- minReadDepth: integer (nullable = true)
|-- genotypeQuality: integer (nullable = true)
|-- splitFromMultiAllelic: boolean (nullable = true)
|-- isPhased: boolean (nullable = true)
|-- phaseSetId: integer (nullable = true)
|-- phaseQuality: integer (nullable = true)

scala> sqlRDD2.columns
res2: Array[String] = Array(variant__variantErrorProbability, variant__contig__contigName, variant__contig__contigLength, variant__contig__contigMD5, variant__contig__referenceURL, variant__contig__assembly, variant__contig__species, variant__contig__referenceIndex, variant__start, variant__end, variant__referenceAllele, variant__alternateAllele, variant__svAllele__type, variant__svAllele__assembly, variant__svAllele__precise, variant__svAllele__startWindow, variant__svAllele__endWindow, variant__isSomatic, variantCallingAnnotations__variantIsPassing, variantCallingAnnotations__downsampled, variantCallingAnnotations__baseQRankSum, variantCallingAnnotations__fisherStrandBiasPValue, variantCallingAnnotations__rmsMapQ, variantCallingAnnotations__mapq0Reads, variantCallingAnnotations__mqRan...
scala>

scala> sqlRDD2.registerTempTable("sqlRDD2")

so far – so nice – but when I want “select “ something ….

scala> val countResult = sqlContext.sql("SELECT COUNT(*) FROM sqlRDD2)").collect()
org.apache.spark.sql.AnalysisException: Non-local session path expected to be non-null;
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:260)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)

any idea ? Found some Internet links with "Non-local session path expected to be non-null;" but unfortunately no answer ...

greetings
-Jerry

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 12, 2015

Member

Hi @jerryivanhoe!

I'm not sure what the error is—I haven't seen that error myself—but, does the error reproduce if you use Spark SQL without flattening the Parquet file? You need to flatten the Parquet file if you are using Hive or Impala, but Spark SQL knows how to process nested schemas.

Member

fnothaft commented Nov 12, 2015

Hi @jerryivanhoe!

I'm not sure what the error is—I haven't seen that error myself—but, does the error reproduce if you use Spark SQL without flattening the Parquet file? You need to flatten the Parquet file if you are using Hive or Impala, but Spark SQL knows how to process nested schemas.

@jerryivanhoe

This comment has been minimized.

Show comment
Hide comment
@jerryivanhoe

jerryivanhoe Nov 13, 2015

Hi Frank,

thanks for looking into this !

yes - same Problem ..

scala> val sqlRDD = sqlContext.parquetFile("hdfs:///user/ec2-user/1kg/chr1.adam")
sqlRDD: org.apache.spark.sql.DataFrame = [variant: structvariantErrorProbability:int,contig:structcontigName:string,contigLength:bigint,contigMD5:string,referenceURL:string,assembly:string,species:string,referenceIndex:int,start:bigint,end:bigint,referenceAllele:string,alternateAllele:string,svAllele:structtype:binary,assembly:string,precise:boolean,startWindow:int,endWindow:int,isSomatic:boolean, variantCallingAnnotations: structvariantIsPassing:boolean,variantFilters:array<string,downsampled:boolean,baseQRankSum:float,fisherStrandBiasPValue:float,rmsMapQ:float,mapq0Reads:int,mqRankSum:float,readPositionRankSum:float,genotypePriors:array,genotypePosteriors:array,vqslod:float,culprit:string,attributes:map<string,string>>, sampleId: string, sampleDescription: string...
scala> sqlRDD.printSchema
root
|-- variant: struct (nullable = true)
| |-- variantErrorProbability: integer (nullable = true)
| |-- contig: struct (nullable = true)
| | |-- contigName: string (nullable = true)
| | |-- contigLength: long (nullable = true)
| | |-- contigMD5: string (nullable = true)
| | |-- referenceURL: string (nullable = true)
| | |-- assembly: string (nullable = true)
| | |-- species: string (nullable = true)
| | |-- referenceIndex: integer (nullable = true)
| |-- start: long (nullable = true)
| |-- end: long (nullable = true)
| |-- referenceAllele: string (nullable = true)
| |-- alternateAllele: string (nullable = true)
| |-- svAllele: struct (nullable = true)
| | |-- type: binary (nullable = true)
| | |-- assembly: string (nullable = true)
| | |-- precise: boolean (nullable = true)
| | |-- startWindow: integer (nullable = true)
| | |-- endWindow: integer (nullable = true)
| |-- isSomatic: boolean (nullable = true)
|-- variantCallingAnnotations: struct (nullable = true)
| |-- variantIsPassing: boolean (nullable = true)
| |-- variantFilters: array (nullable = false)
| | |-- element: string (containsNull = false)
| |-- downsampled: boolean (nullable = true)
| |-- baseQRankSum: float (nullable = true)
| |-- fisherStrandBiasPValue: float (nullable = true)
| |-- rmsMapQ: float (nullable = true)
| |-- mapq0Reads: integer (nullable = true)
| |-- mqRankSum: float (nullable = true)
| |-- readPositionRankSum: float (nullable = true)
| |-- genotypePriors: array (nullable = false)
| | |-- element: float (containsNull = false)
| |-- genotypePosteriors: array (nullable = false)
| | |-- element: float (containsNull = false)
| |-- vqslod: float (nullable = true)
| |-- culprit: string (nullable = true)
| |-- attributes: map (nullable = false)
| | |-- key: string
| | |-- value: string (valueContainsNull = false)
|-- sampleId: string (nullable = true)
|-- sampleDescription: string (nullable = true)
|-- processingDescription: string (nullable = true)
|-- alleles: array (nullable = false)
| |-- element: binary (containsNull = false)
|-- expectedAlleleDosage: float (nullable = true)
|-- referenceReadDepth: integer (nullable = true)
|-- alternateReadDepth: integer (nullable = true)
|-- readDepth: integer (nullable = true)
|-- minReadDepth: integer (nullable = true)
|-- genotypeQuality: integer (nullable = true)
|-- genotypeLikelihoods: array (nullable = false)
| |-- element: float (containsNull = false)
|-- nonReferenceLikelihoods: array (nullable = false)
| |-- element: float (containsNull = false)
|-- strandBiasComponents: array (nullable = false)
| |-- element: integer (containsNull = false)
|-- splitFromMultiAllelic: boolean (nullable = true)
|-- isPhased: boolean (nullable = true)
|-- phaseSetId: integer (nullable = true)
|-- phaseQuality: integer (nullable = true)

scala> sqlRDD.registerTempTable("sqlRDD")

); la> val countResult = sqlContext.sql("SELECT COUNT(*) FROM sqlRDD)").collect(
org.apache.spark.sql.AnalysisException: Non-local session path expected to be non-null;
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:260)

laters
Jerry

jerryivanhoe commented Nov 13, 2015

Hi Frank,

thanks for looking into this !

yes - same Problem ..

scala> val sqlRDD = sqlContext.parquetFile("hdfs:///user/ec2-user/1kg/chr1.adam")
sqlRDD: org.apache.spark.sql.DataFrame = [variant: structvariantErrorProbability:int,contig:structcontigName:string,contigLength:bigint,contigMD5:string,referenceURL:string,assembly:string,species:string,referenceIndex:int,start:bigint,end:bigint,referenceAllele:string,alternateAllele:string,svAllele:structtype:binary,assembly:string,precise:boolean,startWindow:int,endWindow:int,isSomatic:boolean, variantCallingAnnotations: structvariantIsPassing:boolean,variantFilters:array<string,downsampled:boolean,baseQRankSum:float,fisherStrandBiasPValue:float,rmsMapQ:float,mapq0Reads:int,mqRankSum:float,readPositionRankSum:float,genotypePriors:array,genotypePosteriors:array,vqslod:float,culprit:string,attributes:map<string,string>>, sampleId: string, sampleDescription: string...
scala> sqlRDD.printSchema
root
|-- variant: struct (nullable = true)
| |-- variantErrorProbability: integer (nullable = true)
| |-- contig: struct (nullable = true)
| | |-- contigName: string (nullable = true)
| | |-- contigLength: long (nullable = true)
| | |-- contigMD5: string (nullable = true)
| | |-- referenceURL: string (nullable = true)
| | |-- assembly: string (nullable = true)
| | |-- species: string (nullable = true)
| | |-- referenceIndex: integer (nullable = true)
| |-- start: long (nullable = true)
| |-- end: long (nullable = true)
| |-- referenceAllele: string (nullable = true)
| |-- alternateAllele: string (nullable = true)
| |-- svAllele: struct (nullable = true)
| | |-- type: binary (nullable = true)
| | |-- assembly: string (nullable = true)
| | |-- precise: boolean (nullable = true)
| | |-- startWindow: integer (nullable = true)
| | |-- endWindow: integer (nullable = true)
| |-- isSomatic: boolean (nullable = true)
|-- variantCallingAnnotations: struct (nullable = true)
| |-- variantIsPassing: boolean (nullable = true)
| |-- variantFilters: array (nullable = false)
| | |-- element: string (containsNull = false)
| |-- downsampled: boolean (nullable = true)
| |-- baseQRankSum: float (nullable = true)
| |-- fisherStrandBiasPValue: float (nullable = true)
| |-- rmsMapQ: float (nullable = true)
| |-- mapq0Reads: integer (nullable = true)
| |-- mqRankSum: float (nullable = true)
| |-- readPositionRankSum: float (nullable = true)
| |-- genotypePriors: array (nullable = false)
| | |-- element: float (containsNull = false)
| |-- genotypePosteriors: array (nullable = false)
| | |-- element: float (containsNull = false)
| |-- vqslod: float (nullable = true)
| |-- culprit: string (nullable = true)
| |-- attributes: map (nullable = false)
| | |-- key: string
| | |-- value: string (valueContainsNull = false)
|-- sampleId: string (nullable = true)
|-- sampleDescription: string (nullable = true)
|-- processingDescription: string (nullable = true)
|-- alleles: array (nullable = false)
| |-- element: binary (containsNull = false)
|-- expectedAlleleDosage: float (nullable = true)
|-- referenceReadDepth: integer (nullable = true)
|-- alternateReadDepth: integer (nullable = true)
|-- readDepth: integer (nullable = true)
|-- minReadDepth: integer (nullable = true)
|-- genotypeQuality: integer (nullable = true)
|-- genotypeLikelihoods: array (nullable = false)
| |-- element: float (containsNull = false)
|-- nonReferenceLikelihoods: array (nullable = false)
| |-- element: float (containsNull = false)
|-- strandBiasComponents: array (nullable = false)
| |-- element: integer (containsNull = false)
|-- splitFromMultiAllelic: boolean (nullable = true)
|-- isPhased: boolean (nullable = true)
|-- phaseSetId: integer (nullable = true)
|-- phaseQuality: integer (nullable = true)

scala> sqlRDD.registerTempTable("sqlRDD")

); la> val countResult = sqlContext.sql("SELECT COUNT(*) FROM sqlRDD)").collect(
org.apache.spark.sql.AnalysisException: Non-local session path expected to be non-null;
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:260)

laters
Jerry

@jerryivanhoe

This comment has been minimized.

Show comment
Hide comment
@jerryivanhoe

jerryivanhoe Nov 18, 2015

Hi, maybe that the Eggo SQL issue is related to Hive ?!

when I do ...

scala> val sqlRDD2 = sqlContext.parquetFile("hdfs:///user/ec2-user/1kg/chr1.adam_flatten")
scala > sqlRDD2.registerTempTable("sqlRDD2")
scala> val showResult = sqlContext.sql("show tables")

this fails with an error, too:

15/11/18 01:31:06 INFO ObjectStore: ObjectStore, initialize called
15/11/18 01:31:06 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-api-jdo-3.2.1.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-api-jdo-3.2.6.jar."
15/11/18 01:31:06 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-rdbms-3.2.1.jar."
15/11/18 01:31:06 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-core-3.2.2.jar."
15/11/18 01:31:06 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/11/18 01:31:06 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/11/18 01:31:06 WARN HiveMetaStore: Retrying creating default database after error: Error creating transactional connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:587)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:788)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Google-ing "Error creating transactional connection factory" points to misconfigured "hive".

but Hive itself comes up on the master node:

[ec2-user@ip-10-1-1-239 ~]$ hive

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/hive-common-1.1.0-cdh5.4.8.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>

Maybe someone can send me some SQL commands, which are working on his/her Eggo Installation ?!

My project does highly depend on SQL.

thanks
-Jerry

jerryivanhoe commented Nov 18, 2015

Hi, maybe that the Eggo SQL issue is related to Hive ?!

when I do ...

scala> val sqlRDD2 = sqlContext.parquetFile("hdfs:///user/ec2-user/1kg/chr1.adam_flatten")
scala > sqlRDD2.registerTempTable("sqlRDD2")
scala> val showResult = sqlContext.sql("show tables")

this fails with an error, too:

15/11/18 01:31:06 INFO ObjectStore: ObjectStore, initialize called
15/11/18 01:31:06 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-api-jdo-3.2.1.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-api-jdo-3.2.6.jar."
15/11/18 01:31:06 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-rdbms-3.2.1.jar."
15/11/18 01:31:06 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/datanucleus-core-3.2.2.jar."
15/11/18 01:31:06 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/11/18 01:31:06 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/11/18 01:31:06 WARN HiveMetaStore: Retrying creating default database after error: Error creating transactional connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:587)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:788)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Google-ing "Error creating transactional connection factory" points to misconfigured "hive".

but Hive itself comes up on the master node:

[ec2-user@ip-10-1-1-239 ~]$ hive

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/hive-common-1.1.0-cdh5.4.8.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>

Maybe someone can send me some SQL commands, which are working on his/her Eggo Installation ?!

My project does highly depend on SQL.

thanks
-Jerry

@jerryivanhoe

This comment has been minimized.

Show comment
Hide comment
@jerryivanhoe

jerryivanhoe commented Dec 1, 2015

Hi,
I setup a "fresh" Cluster with AWS EMR and this software https://repo1.maven.org/maven2/org/bdgenomics/adam/adam-distribution_2.10/0.18.1/adam-distribution_2.10-0.18.1-bin.tar.gz
The issue has gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment