<a href="https://colab.research.google.com/github/brianadit24/PySpark-Try/blob/main/Part_7_PySpark_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [None]:
# File location and type
file_location = "/FileStore/tables/tips.csv"
file_type = "csv"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.csv(file_location, header=True, inferSchema=True)

In [None]:
df.printSchema()

In [None]:
df.show()

In [None]:
## Handling Categorical Feature
from pyspark.ml.feature import StringIndexer

In [None]:
indexer = StringIndexer(inputCol='sex', outputCol='sex_indexed')
df_r = indexer.fit(df).transform(df)
df_r.show()

In [None]:
categorical_col = ['smoker', 'day', 'time']
indexer = StringIndexer(inputCols=categorical_col, outputCols=[i+'_indexed' for i in categorical_col])
df_r = indexer.fit(df_r).transform(df_r)
df_r.show()

In [None]:
df_r.columns

In [None]:
from pyspark.ml.feature import VectorAssembler
featureassembler = VectorAssembler(inputCols=['tip', 'size', 'sex_indexed', 'smoker_indexed', 'day_indexed', 'time_indexed'],
               outputCol="Features")
output = featureassembler.transform(df_r)

In [None]:
output.select('Features').show()

In [None]:
finalized_data = output.select('Features', 'total_bill')
finalized_data.show()

In [None]:
from pyspark.ml.regression import LinearRegression

## Train Test Split
train_data, test_data = finalized_data.randomSplit([0.75, 0.25])

# Training
regressor = LinearRegression(featuresCol='Features', labelCol='total_bill')
regressor = regressor.fit(train_data)

In [None]:
regressor.coefficients

In [None]:
regressor.intercept

In [None]:
## Final Comparison
pred_result = regressor.evaluate(test_data)
pred_result.predictions.show()

In [None]:
## Peformance Metrics
pred_result.r2, pred_result.meanAbsoluteError, pred_result.meanSquaredError