# Clustering financial data using dataframe

### Importing MLlib libraries

In [None]:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import org.apache.spark.sql.SQLContext
%AddJar http://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar --magic
%AddDeps com.databricks spark-csv_2.10 1.3.0 --transitive

### Read data and pre-processing

The data that we are going to use in this example is stock market data with the ConnorsRSI indicator. ConnorsRSI is a composite indicator made up from RSI_CLOSE_3, PERCENT_RANK_100, and RSI_STREAK_2. We will use these attributes as well as the actual ConnorsRSI (CRSI) and RSI2 to pass into  KMeans algorithm. The calculation of this data is already normalized from 0 to 100.
The other columns like ID, LABEL, RTN5, FIVE_DAY_GL, and CLOSE we will use to do further analysis once we cluster the instances. They will not be passed into the KMeans algorithm.

In [None]:
// load file and remove header
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val allDF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data/spykmeans.csv")

allDF.show()
allDF.schema

### RDD / Dataframe conversions

In [None]:
// convert to rdd and cache the data

val rowsRDD = allDF.rdd.map(r => (r.getString(0), r.getString(1), r.getDouble(2),
                                  r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6),
                                  r.getInt(7), r.getDouble(8), r.getDouble(9) ))

rowsRDD.cache()

// convert data to RDD which will be passed to KMeans and cache the data. We are passing in RSI2, RSI_CLOSE_3, PERCENT_RANK_100,
// RSI_STREAK_2 and CRSI to KMeans. These are the attributes we want to use to assign the instance to a cluster

val vectors = allDF.rdd.map(r => Vectors.dense( r.getDouble(5),r.getDouble(6),
                                               r.getInt(7), r.getDouble(8), r.getDouble(9) ))




In [None]:
rowsRDD.take(5).foreach(println)

### Run cluster analysis

In [None]:
import sqlContext.implicits._
//KMeans model with 2 clusters and 20 iterations

val kMeansModel = KMeans.train(vectors, 2, 20)

//Print the center of each cluster

kMeansModel.clusterCenters.foreach(println)

// Get the prediction from the model with the ID so we can link them back to other information

val predictions = rowsRDD.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._6, r._7, r._8, r._9, r._10) ))}
predictions.take(5).foreach(println)
// convert the rdd to a dataframe
val predDF = predictions.toDF("ID","CLUSTER")


### Join the dataframes on ID

In [None]:
val t = allDF.join(predDF, "ID")
t.printSchema()

### Review a subset of each cluster

In [None]:
t.filter("CLUSTER = 0").show()
t.filter("CLUSTER = 1").show()

### Get descriptive statistics for each cluster

In [None]:
t.filter("CLUSTER = 0").describe("RTN5","FIVE_DAY_GL","CLOSE","RSI2","RSI_CLOSE_3").show()
t.filter("CLUSTER = 0").describe("PERCENT_RANK_100","RSI_STREAK_2","CRSI","CLUSTER").show()
t.filter("CLUSTER = 1").describe("RTN5","FIVE_DAY_GL","CLOSE","RSI2","RSI_CLOSE_3").show()
t.filter("CLUSTER = 1").describe("PERCENT_RANK_100","RSI_STREAK_2","CRSI","CLUSTER").show()

checked