In [1]:
%use spark

In [19]:
// Read a file. This file contains the landing page of kotlin data science page.
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD

var textFile : RDD<String> = SparkSession.builder()
            .orCreate.sparkContext()
            .textFile("../data/data-science.md",1)


#### Actions on a single RDD

In [20]:
// line count
textFile.count()

90

In [21]:
// Read first line
textFile.first()

# Kotlin for Data Science

In [22]:
// Show all lines with workd "kotlin"
textFile.toJavaRDD().filter{ it.contains("kotlin")}.collect()

[[Kotlin-jupyter](https://github.com/Kotlin/kotlin-jupyter) is an open source project that brings Kotlin, ![Kotlin in Jupyter notebook]({{ url_for('asset', path='images/landing/data-science/kotlin-jupyter-kernel.png')}}), Check out Kotlin kernel's [GitHub repo](https://github.com/Kotlin/kotlin-jupyter) for installation, ![Kotlin in Zeppelin notebook]({{ url_for('asset', path='images/landing/data-science/kotlin-zeppelin-interpreter.png')}}), * [kotlin-statistics](https://github.com/thomasnield/kotlin-statistics) is a library providing extension functions for, * [Smile](https://github.com/haifengl/smile) - a comprehensive machine learning, natural language processing, linear algebra, graph, interpolation, and visualization system. Besides Java API, Smile also provides a functional [Kotlin API](http://haifengl.github.io/api/kotlin/smile-kotlin/index.html) along with Scala and Clojure API., [**Kotlin Data Science Resources**](https://github.com/thomasnield/kotlin-data-science-resources) di

Spark loads the file lazily(when first action is called) and does not persist. Each time we call an action like above, spark loads the file. To persist the file we call the `persist` action.

In [23]:
textFile.persist()

../data/data-science.md MapPartitionsRDD[11] at textFile at <unknown>:0

In [24]:
// Mapper
textFile.toJavaRDD().map{ it.toUpperCase()}.take(5)

[# KOTLIN FOR DATA SCIENCE, , FROM BUILDING DATA PIPELINES TO PRODUCTIONIZING MACHINE LEARNING MODELS, KOTLIN CAN BE A GREAT CHOICE FOR, WORKING WITH DATA:, * KOTLIN IS CONCISE, READABLE AND EASY TO LEARN.]

#### Actions on 2 RDDs

In [25]:
// Run a union and print first 2 lines
val kotlinLines = textFile.toJavaRDD().filter{ it.contains("kotlin")}.collect()
val dataLines = textFile.toJavaRDD().filter{ it.contains("data")}.collect()
// Take few items as the complete set may be too big
dataLines.union(kotlinLines).take(2)

// Flatmap
textFile.toJavaRDD().flatMap { it -> (it.split(" ").iterator()) }.take(10)

// See a data sample. Shows 10% os the total dataset
textFile.toJavaRDD().sample(false,0.1).collect()

[Notebooks such as [Jupyter Notebook](https://jupyter.org/) and [Apache Zeppelin](https://zeppelin.apache.org/) provide convenient tools for data visualization and, The ecosystem of libraries for data-related tasks created by the Kotlin community is rapidly expanding., ### Kotlin libraries, streaming operations, a wrapper around [commons-math](http://commons.apache.org/proper/commons-math/) and, Lets-Plot is multiplatform and can be used not only with JVM, but also with JS and Python., , , ]

In [26]:
// Some common actions on RDD
println(dataLines.count())
println(kotlinLines.count())
println(dataLines.union(kotlinLines).count())
println(dataLines.union(kotlinLines).distinct().count())
println(dataLines.subtract(kotlinLines).count())

18
7
22
22
15


In [27]:
dataLines.union(kotlinLines).take(2)

[From building data pipelines to productionizing machine learning models, Kotlin can be a great choice for, working with data:]

#### Word count example
This uses simple map reduce functions to count the number of words in a file

In [41]:
import scala.Tuple2

        textFile.toJavaRDD().flatMap {
            listOf(it.split(" ").toTypedArray())
                .iterator()
        }.mapToPair { word -> Tuple2(word, 1) }
            .reduceByKey { a, b -> a + b }
            .collect().take(5)

[([Ljava.lang.String;@50302a6a,1), ([Ljava.lang.String;@4ea8f3fb,1), ([Ljava.lang.String;@6d70f503,1), ([Ljava.lang.String;@279c30c9,1), ([Ljava.lang.String;@76ae2d2a,1)]