# Spylon Kernel Test with Spark 3.4.0

This has been updated from Spark 2.4. I use a local SBT installation via /misc/build/0/classes. This is similar to the PySpark spark0 notebook.

This must use the same Scala version as Spark - which is 2.13 (it was 2.11).

I haven't recompiled the Scala source code in src - the artikus.spark classes.

Once a Spark context is instantiated, it should be accessible from http://j1:4040 if the host of this notebook is j1. This hostname is spark.driver.host

In [1]:
%%python
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Of no use for a Spylon notebook

## Configuration and Initialization of Spark

Note that we can set things like driver memory etc.

If `launcher._spark_home` is not set it will default to looking at the `SPARK_HOME` environment variable.

I run on a cluster owned by the hadoop user who is a member of my group devel.

I build new features for Scala and access them via /misc/build/0/classes. I have to restart the kernel to access any new classes. And must relaunch Spark to access changes.

I can't change the spark.sql.warehouse.dir. There is no explicit command to enableHiveSupport and no configuration.

This loads external JARs - com.johnsnowlabs.nlp - and its dependencies with ivy. These go to .ivy2/cache.

In [2]:
%%init_spark
launcher.master = "local[*]"
launcher.conf.spark.app.name = "spark-lda"
launcher.conf.spark.executor.cores = 8
launcher.num_executors = 4
launcher.executor_cores = 4
launcher.driver_memory = '4g'
launcher.conf.set("spark.driver.cores", 4);
launcher.conf.set("spark.executor.cores", 4);
launcher.conf.set("spark.executor.memory", "4g");
launcher.conf.set("spark.executor.instances", 4);
launcher.conf.set("spark.sql.catalogImplementation", "hive");
launcher.conf.set("spark.sql.warehouse.dir", "file:///home/hadoop/data/hive");
launcher.conf.set("spark.hadoop.fs.permissions.umask-mode", "002");
launcher.conf.set("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.2");
launcher.conf.set("spark.driver.extraClassPath", ":/misc/build/0/classes/:/usr/share/java/postgresql.jar");

## Spark Configuration

Some basic operations.


In [3]:
import org.apache.spark.sql.{DataFrame, Encoder, Encoders, Row, SparkSession}
SparkSession.builder().enableHiveSupport().getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://j1.host0:4040
SparkContext available as 'sc' (version = 3.4.0, master = local[*], app id = local-1684620375479)
SparkSession available as 'spark'


import org.apache.spark.sql.{DataFrame, Encoder, Encoders, Row, SparkSession}
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@58c6f79b


In [4]:
spark

res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@58c6f79b


In [5]:
spark.version

res2: String = 3.4.0


In [6]:
spark.conf.getAll foreach (x => println(x._1 + " --> " + x._2))

spark.sql.warehouse.dir --> file:/home/hadoop/data/hive
spark.hadoop.fs.permissions.umask-mode --> 002
spark.executor.extraJavaOptions --> -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false
spark.driver.host --> j1.host0
spark.serializer.objectSt

In [7]:
var dbs1 = spark.catalog.listDatabases()
dbs1.show
spark.catalog.listCatalogs().show()

+-------+-------------+--------------------+--------------------+
|   name|      catalog|         description|         locationUri|
+-------+-------------+--------------------+--------------------+
|default|spark_catalog|Default Hive data...|hdfs://k1:8020/us...|
+-------+-------------+--------------------+--------------------+

+-------------+-----------+
|         name|description|
+-------------+-----------+
|spark_catalog|       null|
+-------------+-----------+



dbs1: org.apache.spark.sql.Dataset[org.apache.spark.sql.catalog.Database] = [name: string, catalog: string ... 2 more fields]


In [8]:
dbs1.take(1)(0)

res5: org.apache.spark.sql.catalog.Database = Database[name='default', catalog='spark_catalog', description='Default Hive database', path='hdfs://k1:8020/user/hive/warehouse']


In [32]:
var df0 = spark.sql("show databases")
df0.show()
df0 = spark.sql("show tables")
df0.show(truncate=false)
df0 = spark.sql("select count(*) from finalTable")

+---------+
|namespace|
+---------+
|  default|
+---------+

+---------+----------+-----------+
|namespace|tableName |isTemporary|
+---------+----------+-----------+
|default  |finaltable|false      |
|default  |stage0    |false      |
|default  |stage1    |false      |
|default  |xusers    |false      |
+---------+----------+-----------+



df0: org.apache.spark.sql.DataFrame = [count(1): bigint]
df0: org.apache.spark.sql.DataFrame = [count(1): bigint]
df0: org.apache.spark.sql.DataFrame = [count(1): bigint]


In [10]:
val stage1 = spark.sql("select * from stage1")

stage1: org.apache.spark.sql.DataFrame = [publish_date: int, tokens: array<string> ... 1 more field]


In [31]:
spark.catalog.listTables().select("name").collect().map(_.getString(0)).filter(_.contains("_"))
 .foreach(x => spark.sql(s"drop table ${x}") )

In [35]:
df0 = spark.sql("select * from stage1")
df0.show(false)

+------------+---------------------------------------------------+---------------------------+
|publish_date|tokens                                             |cntvec_79841e49b2ea__output|
+------------+---------------------------------------------------+---------------------------+
|20030219    |[aba, decid, commun, broadcast, licenc]            |{0, 500, [], []}           |
|20030219    |[act, fire, wit, must, awar, defam]                |{0, 500, [], []}           |
|20030219    |[g, call, infrastructur, protect, summit]          |{0, 500, [], []}           |
|20030219    |[air, nz, staff, aust, strike, pai, rise]          |{0, 500, [], []}           |
|20030219    |[air, nz, strike, affect, australian, travel]      |{0, 500, [], []}           |
|20030219    |[ambiti, olsson, win, tripl, jump]                 |{0, 500, [], []}           |
|20030219    |[antic, delight, record, break, barca]             |{0, 500, [], []}           |
|20030219    |[aussi, qualifi, stosur, wast, four,

df0: org.apache.spark.sql.DataFrame = [publish_date: int, tokens: array<string> ... 1 more field]


In [48]:
df0.filter(size($"cntvec_79841e49b2ea__output.indices") !== 0).count()

res37: Long = 1


In [12]:
stage1.printSchema()

root
 |-- publish_date: integer (nullable = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cntvec_79841e49b2ea__output: struct (nullable = true)
 |    |-- type: byte (nullable = true)
 |    |-- size: integer (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: integer (containsNull = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: double (containsNull = true)



In [18]:
stage1.select("features.indices")

org.apache.spark.sql.AnalysisException:  [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `features`.`indices` cannot be resolved. Did you mean one of the following? [`spark_catalog`.`default`.`stage1`.`tokens`, `spark_catalog`.`default`.`stage1`.`publish_date`, `spark_catalog`.`default`.`stage1`.`cntvec_79841e49b2ea__output`].;

In [52]:
stage1.filter(size(col("features.indices")) > 1).collect().size

res38: Int = 0


## Using local Scala Builds

In [2]:
import artikus.spark.U

Intitializing Scala interpreter ...

Spark Web UI available at http://j1.host0:4040
SparkContext available as 'sc' (version = 3.4.0, master = local[*], app id = local-1684091628830)
SparkSession available as 'spark'


import artikus.spark.U


In [3]:
val cl = spark.getClass().getClassLoader()
cl.asInstanceOf[java.net.URLClassLoader].getURLs.map(x => x.toString())

java.lang.ClassCastException:  class jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader and java.net.URLClassLoader are in module java.base of loader 'bootstrap')

In [4]:
// These are from the /misc/build/0/classes
U.identity
U.printClass(spark)
U.alert("hello")

class org.apache.spark.sql.SparkSession
hello


In [None]:
U.classes(spark)

In [None]:
U.flist(".")

## SparkSession operations

Basic operations
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SparkSession.html#createDataset

In [None]:
val strings = spark.emptyDataset[String]
strings.printSchema

In [None]:
val one = spark.createDataset(Seq(1))
one.show
one.printSchema

In [None]:
// Use an implicit requires a "spark" in the namespace.
import spark.implicits._

val one = Seq(1).toDS
one.show
one.printSchema

In [None]:
// Using spark.range()
val range0 = spark.range(start = 0, end = 4, step = 2, numPartitions = 5)
range0.show

In [None]:
// More packing

In [None]:
val sc = spark.sparkContext

In [None]:
val data = Seq("a", "b", "c", "d") zip (0 to 4)

U.printClass(data)

In [None]:
val data = Seq("foo", "bar", "baz") zip 1 :: 2 :: 3 :: Nil
val data1 = Seq("foo", "bar", "bar") zip 4 :: 5 :: 6 :: Nil

In [None]:
val ds = spark.createDataset(data)

val ds1 = sc.parallelize(data)

U.printClass(ds)
U.printClass(ds1)

val ds2 = sc.parallelize(data1)

ds1.join(ds2).take(5)

In [None]:
// Local file URI
// non-existent file loads
// /misc/build/0/prog-scala-2nd-ed-code-examples
val local2 = U.local1(".")

In [None]:
val f1 = "rev-users.csv"
val file = sc.textFile(local2(f1).toString())
U.printClass(file)

In [None]:
// This file has a header row
// Take the first row, index into it, split and return a sequence
val h2 = file.take(1)(0).split(",").toSeq

// Get the remainder by using subtract
// convert the header row back to an RDD using parallelize
val r1 = file.subtract(sc.parallelize(file.take(1)))

In [None]:
// Look at the underlying row
r1.take(1)

In [None]:
// Now map over the quantities
// The transformations are only applied when we take(), use the column names from h2.
val df0 = r1.map(_.split(",")).map{case Array(a,b,c,d,e,f,g,h,i,j,k,l) => 
(a,b.toInt,c,d,e,f.toInt,g,h,i,j.toInt,k.toInt,l.toInt)}.toDF(h2:_*)
df0.take(1)

In [None]:
val f2 = "rev-devices.csv"
val file2 = sc.textFile(local2(f2).toString())
U.printClass(file2)

In [None]:
// But error results here if file does not exist
// Or returns empty array if it is empty
val lens = file.map(s => s.length)
file.take(5)
lens.take(5)

In [None]:
val x0 = file.take(1)

// Some arbitrary file processing - append a number to each line
val pairs = file.map(s => (s, 911))
val counts = pairs.reduceByKey((a, b) => a + b)

In [None]:
val counts1 = counts.repartition(1)

U.rmdir("counts1")
counts1.saveAsTextFile(local2("counts1").toString())

In [None]:
val pairs = file.map(x => (x.split(",")(0), x))

val pairs1 = pairs.join(pairs)

In [None]:
// Make some (K, V) tuples

println(x0(0))

val x1 = x0(0).split(",").toSeq

In [None]:
val df0 = file.map(_.split(",")).map{case Array(a,b,c,d,e,f,g,h,i,j,k,l) => 
(a,b,c,d,e,f,g,h,i,j,k,l)}.toDF(x1:_*)

In [None]:
// The x1:_* is to be preferred to this

// val fileToDf = file.map(_.split(",")).map{case Array(a,b,c,d,e,f,g,h,i,j,k,l) => 
// (a,b,c,d,e,f,g,h,i,j,k,l)}.toDF("user_id", "birth_year", "country", "city", "created_date", "user_settings_crypto_unlocked", "plan", "attributes_notifications_marketing_push", "attributes_notifications_marketing_email", "num_contacts", "num_referrals", "num_successful_referrals")

In [None]:
val df0 = file.map(_.split(",")).map{case Array(a,b,c,d,e,f,g,h,i,j,k,l) => 
(a,b.toInt,c,d,e,f,g,h,i,j,k,l)}.toDF(x1:_*)

In [None]:
fileToDf.show(3)

In [None]:
file.map(_.split(",")).take(1)

In [None]:
val df1 = file.subtract(sc.parallelize(file.take(1)))

In [None]:
U.printClass(sc)

In [None]:
df1.take(1)

In [None]:
def split(f1:String, sep:String)(implicit sc: org.apache.spark.SparkContext) : org.apache.spark.rdd.RDD[String] = {
    val f = sc.textFile(f1)
    return f
}

In [None]:
split(local2(f1).toString(), ",")(sc)

In [None]:
U.printClass(sc)

## MLLib

Using LDA from this example. The data chosen is not very good. The headlines don't have enough words to trigger on.

https://medium.com/analytics-vidhya/distributed-topic-modelling-using-spark-nlp-and-spark-mllib-lda-6db3f06a4da3

In [53]:
val url = "file:///a/l/X-image/cache/data/abcnews-date-text.csv"

url: String = file:///a/l/X-image/cache/data/abcnews-date-text.csv


In [54]:
val type0="csv"
val infer_schema="true"
val first_row_is_header = "true"
val delimiter=","

type0: String = csv
infer_schema: String = true
first_row_is_header: String = true
delimiter: String = ,


In [55]:
val df0 = spark.read.format(type0)
  .option("inferSchema", infer_schema)
  .option("header", first_row_is_header)
  .option("sep", delimiter)
  .load(url)

df0: org.apache.spark.sql.DataFrame = [publish_date: int, headline_text: string]


In [56]:
df0.count()
df0.show()

+------------+--------------------+
|publish_date|       headline_text|
+------------+--------------------+
|    20030219|aba decides again...|
|    20030219|act fire witnesse...|
|    20030219|a g calls for inf...|
|    20030219|air nz staff in a...|
|    20030219|air nz strike to ...|
|    20030219|ambitious olsson ...|
|    20030219|antic delighted w...|
|    20030219|aussie qualifier ...|
|    20030219|aust addresses un...|
|    20030219|australia is lock...|
|    20030219|australia to cont...|
|    20030219|barca take record...|
|    20030219|bathhouse plans m...|
|    20030219|big hopes for lau...|
|    20030219|big plan to boost...|
|    20030219|blizzard buries u...|
|    20030219|brigadier dismiss...|
|    20030219|british combat tr...|
|    20030219|bryant leads lake...|
|    20030219|bushfire victims ...|
+------------+--------------------+
only showing top 20 rows



In [57]:
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.Normalizer
import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
import com.johnsnowlabs.nlp.annotators.Stemmer
import com.johnsnowlabs.nlp.Finisher

import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.Normalizer
import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
import com.johnsnowlabs.nlp.annotators.Stemmer
import com.johnsnowlabs.nlp.Finisher


In [58]:
// Split sentence to tokens(array)
val document_assembler = new DocumentAssembler().setInputCol("headline_text").setOutputCol("document").setCleanupMode("shrink") 

// clean unwanted characters and garbage
val tokenizer = new Tokenizer().setInputCols(Array("document")).setOutputCol("token")

val normalizer = new Normalizer().setInputCols(Array("token")).setOutputCol("normalized")

// remove stopwords
val stopwords_cleaner = new StopWordsCleaner().setInputCols("normalized").setOutputCol("cleanTokens").setCaseSensitive(false)

// stem the words to bring them to the root form.
val stemmer = new Stemmer().setInputCols(Array("cleanTokens")).setOutputCol("stem")

// Finisher is the most important annotator. 
// Spark NLP adds its own structure when we convert each row in the dataframe to document. 
// Finisher helps us to bring back the expected structure viz. array of tokens.
val finisher = new Finisher().setInputCols(Array("stem")).setOutputCols(Array("tokens"))
    .setOutputAsArray(true).setCleanAnnotations(false)

document_assembler: com.johnsnowlabs.nlp.DocumentAssembler = document_30ec569ce6fc
tokenizer: com.johnsnowlabs.nlp.annotators.Tokenizer = REGEX_TOKENIZER_1b1a4c767d8a
normalizer: com.johnsnowlabs.nlp.annotators.Normalizer = NORMALIZER_a6cb81d75b4a
stopwords_cleaner: com.johnsnowlabs.nlp.annotators.StopWordsCleaner = STOPWORDS_CLEANER_a8269ae19c1f
stemmer: com.johnsnowlabs.nlp.annotators.Stemmer = STEMMER_c317dffc1e47
finisher: com.johnsnowlabs.nlp.Finisher = finisher_42b56442e7b2


In [59]:
import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.Pipeline


In [60]:
// We build a ml pipeline so that each phase can be executed in sequence. 
// This pipeline can also be used to test the model. 
// train the pipeline
val stages=Array(document_assembler, tokenizer, normalizer, stopwords_cleaner, stemmer, finisher)
val nlp_pipeline = new Pipeline().setStages(stages)

stages: Array[org.apache.spark.ml.PipelineStage with org.apache.spark.ml.util.DefaultParamsWritable] = Array(document_30ec569ce6fc, REGEX_TOKENIZER_1b1a4c767d8a, NORMALIZER_a6cb81d75b4a, STOPWORDS_CLEANER_a8269ae19c1f, STEMMER_c317dffc1e47, finisher_42b56442e7b2)
nlp_pipeline: org.apache.spark.ml.Pipeline = pipeline_a5fbc1e948c4


In [61]:
//  apply the pipeline to transform dataframe.
val nlp_model = nlp_pipeline.fit(df0) 
val processed_df0  = nlp_model.transform(df0)

nlp_model: org.apache.spark.ml.PipelineModel = pipeline_a5fbc1e948c4
processed_df0: org.apache.spark.sql.DataFrame = [publish_date: int, headline_text: string ... 6 more fields]


In [64]:
processed_df0.count()

res42: Long = 1082168


In [65]:
//  nlp pipeline create intermediary columns that we dont need. So lets select the columns that we need
val tokens_df0 = processed_df0.select("publish_date","tokens").limit(100000)
tokens_df0.show()

org.apache.spark.SparkException:  Job aborted due to stage failure: Task 0 in stage 40.0 failed 1 times, most recent failure: Lost task 0.0 in stage 40.0 (TID 54) (j1.host0 executor driver): org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (HasSimpleAnnotate$$Lambda$5311/0x0000000801ff7040: (array<array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>>) => array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>).

## Features

To generate features from textual data. Latent Dirichlet Allocation requires a data-specific vocabulary to perform topic modeling.

In [20]:
import org.apache.spark.ml.feature.CountVectorizer

import org.apache.spark.ml.feature.CountVectorizer


In [21]:
val cv = new CountVectorizer().setInputCol("tokens").setOutputCol("features").setVocabSize(500).setMinTF(3.0)

// train the model
val cv_model = cv.fit(tokens_df0)
// transform the data. Output column name will be features.
val vectorized_tokens = cv_model.transform(tokens_df0)

cv: org.apache.spark.ml.feature.CountVectorizer = cntVec_3a8a98006498
cv_model: org.apache.spark.ml.feature.CountVectorizerModel = CountVectorizerModel: uid=cntVec_3a8a98006498, vocabularySize=500
vectorized_tokens: org.apache.spark.sql.DataFrame = [publish_date: int, tokens: array<string> ... 1 more field]


In [22]:
vectorized_tokens.show()

+------------+--------------------+-----------+
|publish_date|              tokens|   features|
+------------+--------------------+-----------+
|    20030219|[aba, decid, comm...|(500,[],[])|
|    20030219|[act, fire, wit, ...|(500,[],[])|
|    20030219|[g, call, infrast...|(500,[],[])|
|    20030219|[air, nz, staff, ...|(500,[],[])|
|    20030219|[air, nz, strike,...|(500,[],[])|
|    20030219|[ambiti, olsson, ...|(500,[],[])|
|    20030219|[antic, delight, ...|(500,[],[])|
|    20030219|[aussi, qualifi, ...|(500,[],[])|
|    20030219|[aust, address, u...|(500,[],[])|
|    20030219|[australia, lock,...|(500,[],[])|
|    20030219|[australia, contr...|(500,[],[])|
|    20030219|[barca, take, rec...|(500,[],[])|
|    20030219|[bathhous, plan, ...|(500,[],[])|
|    20030219|[big, hope, launc...|(500,[],[])|
|    20030219|[big, plan, boost...|(500,[],[])|
|    20030219|[blizzard, buri, ...|(500,[],[])|
|    20030219|[brigadi, dismiss...|(500,[],[])|
|    20030219|[british, combat,...|(500,

## Build Model


In [28]:
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.sql.Row

import org.apache.spark.ml.clustering.LDA
import org.apache.spark.sql.Row


In [29]:
val num_topics = 5
val lda = new LDA().setK(num_topics).setMaxIter(100)
val model = lda.fit(vectorized_tokens)

val ll = model.logLikelihood(vectorized_tokens)
val lp = model.logPerplexity(vectorized_tokens)

num_topics: Int = 5
lda: org.apache.spark.ml.clustering.LDA = lda_ebc947d82821
model: org.apache.spark.ml.clustering.LDAModel = LocalLDAModel: uid=lda_ebc947d82821, k=5, numFeatures=500
ll: Double = -991.6942161586913
lp: Double = 110.18824623985459


In [248]:
println("The lower bound on the log likelihood of the entire corpus: " + ll.toString())
println("The upper bound on perplexity: " + lp.toString())

The lower bound on the log likelihood of the entire corpus: -991.6942161586913
The upper bound on perplexity: 110.18824623985459


In [249]:
val topics = model.describeTopics(num_topics)
println("The topics described by their top-weighted terms:")
topics.show(false)

The topics described by their top-weighted terms:
+-----+-------------------------+------------------------------------------------------------------------------------------------------------------+
|topic|termIndices              |termWeights                                                                                                       |
+-----+-------------------------+------------------------------------------------------------------------------------------------------------------+
|0    |[300, 69, 359, 493, 286] |[0.002655236612160945, 0.002497961414830468, 0.0024899769599476575, 0.0024601846493057153, 0.00244648449557187]   |
|1    |[239, 369, 201, 329, 98] |[0.0025134790115381414, 0.0025013211527450227, 0.0024899204926820107, 0.0024621807786108577, 0.002460169764610191]|
|2    |[319, 298, 177, 155, 401]|[0.015238049286679918, 0.0025746153634851575, 0.002442532964731303, 0.0024237449892337945, 0.0024162537292844602] |
|3    |[261, 491, 145, 248, 316]|[0.00253436396446723, 0

topics: org.apache.spark.sql.DataFrame = [topic: int, termIndices: array<int> ... 1 more field]


In [42]:
val transformed = model.transform(vectorized_tokens)
val tr0 = transformed.select("topicDistribution").collect()
transformed.show(false)

+------------+---------------------------------------------------+-----------+---------------------+
|publish_date|tokens                                             |features   |topicDistribution    |
+------------+---------------------------------------------------+-----------+---------------------+
|20030219    |[aba, decid, commun, broadcast, licenc]            |(500,[],[])|[0.0,0.0,0.0,0.0,0.0]|
|20030219    |[act, fire, wit, must, awar, defam]                |(500,[],[])|[0.0,0.0,0.0,0.0,0.0]|
|20030219    |[g, call, infrastructur, protect, summit]          |(500,[],[])|[0.0,0.0,0.0,0.0,0.0]|
|20030219    |[air, nz, staff, aust, strike, pai, rise]          |(500,[],[])|[0.0,0.0,0.0,0.0,0.0]|
|20030219    |[air, nz, strike, affect, australian, travel]      |(500,[],[])|[0.0,0.0,0.0,0.0,0.0]|
|20030219    |[ambiti, olsson, win, tripl, jump]                 |(500,[],[])|[0.0,0.0,0.0,0.0,0.0]|
|20030219    |[antic, delight, record, break, barca]             |(500,[],[])|[0.0,0.0,0.0,

transformed: org.apache.spark.sql.DataFrame = [publish_date: int, tokens: array<string> ... 2 more fields]
tr0: Array[org.apache.spark.sql.Row] = Array([[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0.0]], [[0.0,0.0,0.0,0.0,0....


In [51]:
def adSum(ad: Array[Double]) = {
  var sum = 0.0
  var i = 0
  while (i<ad.length) { sum += ad(i); i += 1 }
  sum
}

adSum: (ad: Array[Double])Double


In [85]:
// do some casting to array
val tr1 = tr0.map(_(0)).map(_.asInstanceOf[org.apache.spark.ml.linalg.DenseVector].toArray)

In [88]:
tr1.map(x => adSum(x) ).filter(_ > 0)

res57: Array[Double] = Array(1.0, 0.9999999999999998, 1.0)


In [37]:
val a = Array.tabulate(100)(_.toDouble)
val ab = new collection.mutable.ArrayBuffer[Double] ++ a
ab

a: Array[Double] = Array(0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0)
ab: scala.collection.mutable.ArrayBuffer[Double] = ArrayBuffer(0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21....


In [40]:
(0.0 /: a)(_ + _)

res21: Double = 4950.0


## Visualization

The results from the algorithm need to restructured.

In [240]:
val vocab = cv_model.vocabulary
val topics = model.describeTopics()
val topics_rdd = topics.rdd

vocab: Array[String] = Array(u, polic, govt, new, plan, man, council, iraq, sai, call, kill, win, charg, court, back, face, claim, urg, fund, report, warn, fire, nsw, boost, take, get, attack, set, death, water, qld, wa, mai, probe, group, consid, seek, crash, hospit, open, continu, world, health, concern, cup, miss, lead, hope, mp, pm, protest, meet, help, servic, test, two, murder, chang, hit, iraqi, australia, minist, ban, talk, drug, war, sydnei, home, vic, secur, sa, rise, support, year, top, industri, road, bomb, work, investig, final, nt, air, car, offer, welcom, return, fear, act, reject, jail, case, hous, forc, job, fight, record, make, strike, end, deal, defend, worker, arrest, pai, trial, australian, power, public, farmer, m, cut, move, rule, still, dai, dead, south, want, tr...


In [241]:
// Define the schema and make a data frame with it

import org.apache.spark.sql.types._

val schema = new StructType()
  .add(StructField("id", IntegerType, false))
  .add(StructField("indices", ArrayType(IntegerType, true)))
  .add(StructField("scores", ArrayType(DoubleType, true)))

import org.apache.spark.sql.types._
schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,false),StructField(indices,ArrayType(IntegerType,true),true),StructField(scores,ArrayType(DoubleType,true),true))


In [242]:
import spark.implicits._
// import org.apache.spark.sql.functions.explode

val df1 = spark.createDataFrame(topics_rdd, schema)

df1.printSchema()

root
 |-- id: integer (nullable = false)
 |-- indices: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- scores: array (nullable = true)
 |    |-- element: double (containsNull = true)



import spark.implicits._
df1: org.apache.spark.sql.DataFrame = [id: int, indices: array<int> ... 1 more field]


In [243]:
// Using the column names create a case class that has the Arrays in it
case class tab1(id: Int, indices: WrappedArray[Int], scores: WrappedArray[Double])

defined class tab1


In [244]:
// Cast the dataframe to be of that type.
val df2 = df1.as[tab1]
df2.show

+---+--------------------+--------------------+
| id|             indices|              scores|
+---+--------------------+--------------------+
|  0|[300, 69, 359, 49...|[0.00273718541331...|
|  1|[239, 369, 201, 3...|[0.00257811336156...|
|  2|[298, 177, 155, 4...|[0.00268410608851...|
|  3|[261, 491, 145, 2...|[0.00260131780330...|
|  4|[54, 384, 348, 89...|[0.00956404662531...|
+---+--------------------+--------------------+



df2: org.apache.spark.sql.Dataset[tab1] = [id: int, indices: array<int> ... 1 more field]


In [245]:
// Then the columns can be accessed as members and with types.
// this uses the indices into the vocab 

val df3 = df2.map(x => x.indices.map(vocab).zip(x.scores)).collect.toList.map {
    _.map(x => (x._1, x._2))
}

df3: List[scala.collection.mutable.WrappedArray[(String, Double)]] = List(WrappedArray((poll,0.0027371854133140853), (secur,0.002560246657611293), (raid,0.002551263934741385), (line,0.0025177467974942472), (bill,0.002502333762281136), (teacher,0.0024809462044122427), (race,0.002456433101087516), (act,0.0024250307641168), (nt,0.0024169064692088882), (author,0.002416614667940068)), WrappedArray((korea,0.002578113361563147), (nurs,0.002564425666431991), (doctor,0.002551590449154687), (increas,0.002520360217278145), (strike,0.00251809615524107), (poll,0.0025062644280818887), (british,0.0025029408660668896), (kill,0.002490891680355644), (brisban,0.0024788558521510163), (tip,0.002477245307779051)), WrappedArray((storm,0.0026841060885122938), (look,0.002533607378645535), (tour,0.00251219978686...


In [246]:
df3.map(y => { println("::"); y.map(x => println(x._1 + " :: " + x._2) ) } );

::
poll :: 0.0027371854133140853
secur :: 0.002560246657611293
raid :: 0.002551263934741385
line :: 0.0025177467974942472
bill :: 0.002502333762281136
teacher :: 0.0024809462044122427
race :: 0.002456433101087516
act :: 0.0024250307641168
nt :: 0.0024169064692088882
author :: 0.002416614667940068
::
korea :: 0.002578113361563147
nurs :: 0.002564425666431991
doctor :: 0.002551590449154687
increas :: 0.002520360217278145
strike :: 0.00251809615524107
poll :: 0.0025062644280818887
british :: 0.0025029408660668896
kill :: 0.002490891680355644
brisban :: 0.0024788558521510163
tip :: 0.002477245307779051
::
storm :: 0.0026841060885122938
look :: 0.002533607378645535
tour :: 0.0025121997868690366
women :: 0.002503664017014095
remain :: 0.002479214785423427
top :: 0.0024574859047018613
question :: 0.002451903299087713
rais :: 0.0024474415348164436
g :: 0.0024332839219606646
busi :: 0.0024330620637832406
::
offic :: 0.002601317803300338
post :: 0.0025859822436945655
deni :: 0.002535437793866789

res149: List[scala.collection.mutable.WrappedArray[Unit]] = List(WrappedArray((), (), (), (), (), (), (), (), (), ()), WrappedArray((), (), (), (), (), (), (), (), (), ()), WrappedArray((), (), (), (), (), (), (), (), (), ()), WrappedArray((), (), (), (), (), (), (), (), (), ()), WrappedArray((), (), (), (), (), (), (), (), (), ()))


In [105]:
df1.createOrReplaceTempView("topics")

In [110]:
spark.sql("show tables").show
spark.sql("select count(*) from topics").show

+---------+----------+-----------+
|namespace| tableName|isTemporary|
+---------+----------+-----------+
|  default|finaltable|      false|
|         |    topics|       true|
+---------+----------+-----------+

+--------+
|count(1)|
+--------+
|       3|
+--------+

