## Overview
This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. DBFS is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

In [None]:
# Read the csv file in spark

# File location and type
file_location = "/FileStore/tables/tips.csv"
file_type = "csv"

# The applied options are for CSV files. For other file types, these will be
# ignored.
df =spark.read.csv(file_location,header=True,inferSchema=True)
df.show()

In [None]:
# Printing the dataframe's schema
df.printSchema()

# Viewing data frame's columns
df.columns

In [None]:
### Handling Categorical Features
from pyspark.ml.feature import StringIndexer

indexer=StringIndexer(inputCol="sex",outputCol="sex_indexed")
df_r=indexer.fit(df).transform(df)
df_r.show()  

indexer=StringIndexer(inputCols=["smoker","day","time"],outputCols=
                      ["smoker_indexed","day_indexed", "time_index"])
df_r=indexer.fit(df_r).transform(df_r)
df_r.show()

In [None]:
df_r.columns

In [None]:
# generating independent features
from pyspark.ml.feature import VectorAssembler
featureassembler=VectorAssembler(inputCols=['tip','size','sex_indexed',
                                            'smoker_indexed','day_indexed',
                          'time_index'],outputCol="Independent Features")
output=featureassembler.transform(df_r)

output.select('Independent Features').show()

In [None]:
output.show()

In [None]:
finalized_data=output.select("Independent Features","total_bill")
finalized_data.show()

In [None]:
##train test split and model training
from pyspark.ml.regression import LinearRegression
train_data,test_data=finalized_data.randomSplit([0.75,0.25])
regressor=LinearRegression(featuresCol='Independent Features',
                           labelCol='total_bill')
regressor=regressor.fit(train_data)

In [None]:
# Viewing its coefficients and intercept
regressor.coefficients
regressor.intercept

In [None]:
### Predictions
pred_results=regressor.evaluate(test_data)

In [None]:
## Final comparison
pred_results.predictions.show()

In [None]:
### PErformance Metrics
pred_results.r2,pred_results.meanAbsoluteError,pred_results.meanSquaredError 