## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
import numpy as np
from math import sqrt
from pyspark.ml.clustering import KMeans, KMeansModel
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets.samples_generator import make_blobs
from pyspark.sql import SQLContext

spark = SparkSession.builder.getOrCreate()
spark

In [3]:
#reading files to df
df1 = spark.read.format("csv")\
       .option("header", "true")\
       .option("inferSchema", "true")\
       .load("dbfs:/FileStore/tables/*.csv") 

In [4]:
df1.printSchema()

In [5]:
print('total number of rows and columns in dataframe:')
(df1.count() , len(df1.columns))

In [6]:
#calculating aggregate of NA values
df7 = df1.select(['model','smart_1_normalized','failure'])
data_agg = df7.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df7.columns])
data_agg.show()

In [7]:
df4 = df1.select('serial_number','smart_1_normalized')

In [8]:
#replace null values by zero
df4 = df4.fillna(0)
df4.show()

In [9]:
#using vector assembler to get the feature column later to be used for k-means
from pyspark.ml.feature import VectorAssembler,VectorIndexer
assembler = VectorAssembler(inputCols=['smart_1_normalized'], outputCol="smart_vec")
# fill the null values
df3_algo = assembler.transform(df4)

In [10]:
df3_algo.show()

In [11]:
#Reference: https://rsandstroem.github.io/sparkkmeans.html 
#This code was taking too long
#cost = np.zeros(10)
#for k in range(2,8):
  #  kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("smart_vec")
  #  model = kmeans.fit(df3_algo.sample(False,0.1, seed=25))
#    cost[k] = model.computeCost(df3_algo)

In [12]:
#here I have taken k = 5 please note that it was taking too long for the above process as I was doing it on databricks
#Reference: https://rsandstroem.github.io/sparkkmeans.html 
k = 5
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("smart_vec")
model1 = kmeans.fit(df3_algo)
centers1 = model1.clusterCenters()

print("Cluster Centers: ")
for center in centers1:
    print(center)

In [13]:
#adding values of clusters to dataframe
k_transform = model1.transform(df3_algo)
k_transform.show(10)

In [14]:
k_transform.select('prediction').describe().show()

In [15]:
#taking approximate values of centers
c1 = 100
c2 = 74
c3 = 200
c4 = 116
c5 = 81
centers_01 = np.array([[c1,0], [c2,1], [c3,2],[c4,3],[c5,4]])
centers2 = sc.parallelize(centers_01)
centers3 = centers2.map(lambda x: [int(i) for i in x])
df3_centers = centers3.toDF(["center","prediction"])
df3_centers.show()

In [16]:
#perform simple join, dataframe is sorted due to it so values appear similar
k_means1 = k_transform.join(df3_centers, 'prediction')
k_means1.show()

In [17]:
k_means1.printSchema()

In [18]:
k_means1.select('prediction').distinct().show()

In [19]:
#setting max and min columns in the dataframe
k_means2 = k_means1.withColumn("Min",k_means1["center"]-100)
k_means2 = k_means2.withColumn("Max",k_means1["center"]+100)

In [20]:
#calculating euclidean distance of value from centroid
k_means3 = k_means2.withColumn("euclidean_dist",((k_means2["center"])**2-(k_means2['smart_1_normalized'])**2)**0.5)

In [21]:
k_means3.show()

In [23]:
#taking columns to display a scatter plot 
xx = k_means3.select('prediction')
yy = k_means3.select('smart_1_normalized')
xx = xx.toPandas()
yy = yy.toPandas()
xx = xx.values
yy = yy.values

In [24]:
%matplotlib inline

plt.scatter(xx,yy)