## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
from pyspark.sql import SQLContext
from pyspark.sql import DataFrameNaFunctions

In [3]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import Binarizer
from pyspark.ml.feature import VectorAssembler,StringIndexer,VectorIndexer

In [4]:
sqlContext = SQLContext(sc)
df = sqlContext.read.load('/FileStore/tables/daily_weather.csv',format = 'com.databricks.spark.csv',header='true',inferSchema='true')

In [5]:
df.columns

In [6]:
featureColumns = ['air_pressure_9am','air_temp_9am','avg_wind_direction_9am','avg_wind_speed_9am',
        'max_wind_direction_9am','max_wind_speed_9am','rain_accumulation_9am',
        'rain_duration_9am']

In [7]:
df = df.drop('number')

In [8]:
df = df.na.drop()

In [9]:
df.count(),len(df.columns)

In [10]:
binarizer = Binarizer(threshold=24.99999,inputCol="relative_humidity_3pm",outputCol="label")
binarizedDF = binarizer.transform(df)

In [11]:
binarizedDF.describe()

In [12]:
binarizedDF.select("relative_humidity_3pm","label").show(4)

In [13]:
assembler = VectorAssembler(inputCols=featureColumns,outputCol="features")
assembled = assembler.transform(binarizedDF)

In [14]:
assembled.printSchema()

In [15]:
(trainingData,testData) = assembled.randomSplit([0.8,0.2],seed = 13234)

In [16]:
trainingData.count(),testData.count()

In [17]:
dt = DecisionTreeClassifier(labelCol="label",featuresCol="features",maxDepth=5,minInstancesPerNode=20,impurity="gini")

In [18]:
pipeline = Pipeline(stages=[dt])
model = pipeline.fit(trainingData)

In [19]:
predictions = model.transform(testData)

In [20]:
predictions.select("prediction","label").show(20)

In [22]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics

In [23]:
df.show(10)

In [24]:
predictions = sqlContext.read.load('/FileStore/tables/predictions.csv/part-00000-tid-7491931305090165215-8c5321c1-54a4-4785-aeae-2ed1d057cb4e-88-c000.csv',format = 'com.databricks.spark.csv',header='true',inferSchema='true')

In [25]:
evaluator = MulticlassClassificationEvaluator(labelCol="label",predictionCol="prediction",metricName="accuracy")

In [26]:
accuracy = evaluator.evaluate(predictions)

In [27]:
predictions.show(10)

In [28]:
print("Accuracy = %g " % (accuracy))

In [29]:
predictions.rdd.take(2)

In [30]:
predictions.rdd.map(tuple).take(2)

In [31]:
metrics = MulticlassMetrics(predictions.rdd.map(tuple))

In [32]:
metrics.confusionMatrix().toArray().transpose()

In [33]:
print ("Error = %g " % (1.0 - accuracy))

In [34]:
print ("Accuracy = %.2g" % (accuracy * 100))

In [35]:
metrics.confusionMatrix().toArray()