You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
def start(): StreamingQuery = {
...
if (source == "memory") {
...
} else if (source == "foreach") {
...
} else {
val ds = DataSource.lookupDataSource(source, df.sparkSession.sessionState.conf)
val disabledSources = df.sparkSession.sqlContext.conf.disabledV2StreamingWriters.split(",")
val sink = ds.newInstance() match {
case w: StreamWriteSupport if !disabledSources.contains(w.getClass.getCanonicalName) => w
case _ =>
val ds = DataSource(
df.sparkSession,
className = source,
options = extraOptions.toMap,
partitionColumns = normalizedParCols.getOrElse(Nil))
ds.createSink(outputMode)
}
...
其中有个lookupDataSource方法,
case sources =>
// There are multiple registered aliases for the input. If there is single datasource
// that has "org.apache.spark" package in the prefix, we use it considering it is an
// internal datasource within Spark.
val sourceNames = sources.map(_.getClass.getName)
val internalSources = sources.filter(_.getClass.getName.startsWith("org.apache.spark"))
if (internalSources.size == 1) {
logWarning(s"Multiple sources found for $provider1 (${sourceNames.mkString(", ")}), " +
s"defaulting to the internal datasource (${internalSources.head.getClass.getName}).")
internalSources.head.getClass
} else {
throw new AnalysisException(s"Multiple sources found for $provider1 " +
s"(${sourceNames.mkString(", ")}), please specify the fully qualified class name.")
}
背景:
在数据平台部门中, 当需求井喷了之后, 对每个需求做定制化的编码已经不现实了。这时候一般都需要与业务部门合作,这时最好有一个可视化的开发UI, 后端有个service层, 用户所有对数据处理的逻辑通过配置文件,或者纯SQL来表达。这时用户有看实时结果的需求来验证代码或者配置的正确性。
已有的方案及不足:
在其他博客有看到通过websocket来拿当前的spark日志, 包括spark运行时的日志, 这样就可以把streaming的sink改为console模式来获得一部分的日志。
不足之处在于, console模式获得的日志其实是dataframe.show()的结果, 对于前端的交互可视化会比较差。那可不可以做到运行时获得一部分数据并以类似json的方式传回来
源码
先看下console的实现方式,从start方法入手:
其中有个lookupDataSource方法,
看来类名要以org.apache.spark开头才行, 按console的方式, 实现一下自己的Debug sink,
checkpoint 的地址必须要指定, 因为源代码中可以看到
useTempCheckpointLocation = source == "console",
随机一个/tmp下的目录即可这样json格式的debug用的部分dataframe就可以打印出来了, 我们可以上传到redis或者kafka用于前端画图, 方便用户debug, 查看列类型等等。
The text was updated successfully, but these errors were encountered: