### Get Start with Machine Learning Using Apache Spark using SparkML (Linear Regression)
Steps to implement a machine learning algorithm to predict a target variable using sparkML
1. Import necessary libraries
2. Create a spark session
3. Read the data
4. Feature selection (decide which features are relevant)
5. Create a vector assembler
6. Transform the selected data using the vector assembler
7. Split the data into training and testing data sets
8. Create an instance of the model (linear regression)
9. Fit the model to the train data
10. Make prediction on the test data
11. Evaluate the model
12. Print the RMSE
13. Print the coefficents and the intercept of the model
14. Finally, Stop  spark session

#### 1. Import necessary libraries
- SparkSession
- VectorAssembler
- LinearRegression
- RegressionEvaluator

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

#### 2. Start a SparkSession

In [2]:
spark = SparkSession.builder.appName('Machine Learning using Apache Spark').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/08 20:10:54 WARN Utils: Your hostname, omar, resolves to a loopback address: 127.0.1.1; using 192.168.1.4 instead (on interface wlo1)
25/12/08 20:10:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/08 20:10:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### 3. Read the data
Load data from a csv file

In [9]:
df = spark.read.csv('data/mpg.csv',header=True,inferSchema=True)
df.show(5)

+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|15.0|        8|      390.0|       190|  3850|       8.5|  70|American|
|21.0|        6|      199.0|        90|  2648|      15.0|  70|American|
|18.0|        6|      199.0|        97|  2774|      15.5|  70|American|
|16.0|        8|      304.0|       150|  3433|      12.0|  70|American|
|14.0|        8|      455.0|       225|  3086|      10.0|  70|American|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows


#### 4. Select relevant columns
Features: Cylinders, Engine Disp, Horsepower, Weight,Accelerate

Target: MPG

In [11]:
columns = ['Cylinders', 'Engine Disp', 'Horsepower', 'Weight','Accelerate','MPG']
selected_data = df.select(columns)
selected_data.show(5)

+---------+-----------+----------+------+----------+----+
|Cylinders|Engine Disp|Horsepower|Weight|Accelerate| MPG|
+---------+-----------+----------+------+----------+----+
|        8|      390.0|       190|  3850|       8.5|15.0|
|        6|      199.0|        90|  2648|      15.0|21.0|
|        6|      199.0|        97|  2774|      15.5|18.0|
|        8|      304.0|       150|  3433|      12.0|16.0|
|        8|      455.0|       225|  3086|      10.0|14.0|
+---------+-----------+----------+------+----------+----+
only showing top 5 rows


#### 5. Create a VectorAssembler
**What does the VectorAssembler do?**
- It transform input columns and combine them into one columns as one vector
for example we've dataset contains three columns [column_1, column_2, columns_3] and their values are [50, 22, 30] respectively.

`VectoryAssembler(inputCols=[column_1, column_2, column_3], outputCol='new_column')`.

it creates new column called `new_column` contains a vector of values of the three columns as the following `new_column`=[50,22,30]

In [None]:
#first we need to create the assembler and pass to it what input columns and the name of the output column
assembler = VectorAssembler(inputCols=columns[:-1], outputCol='features')

#then use that assembler to apply the transformation on a dataset 
#Make sure the input columns are exists in the selected_data
transformed_data = assembler.transform(selected_data)

In [25]:
#Split the data into train and test data
train_data,  test_data = transformed_data.randomSplit([0.7,0.3],seed=123)

In [27]:
#Create LinearRegression instance
lr= LinearRegression(featuresCol='features', labelCol='MPG')

In [28]:
model = lr.fit(train_data)

25/12/08 20:26:55 WARN Instrumentation: [c97ac7a1] regParam is zero, which might cause numerical instability and overfitting.


In [29]:
predictions = model.transform(test_data)

In [31]:
predictions.show()

+---------+-----------+----------+------+----------+----+--------------------+------------------+
|Cylinders|Engine Disp|Horsepower|Weight|Accelerate| MPG|            features|        prediction|
+---------+-----------+----------+------+----------+----+--------------------+------------------+
|        3|       70.0|       100|  2420|      12.5|23.7|[3.0,70.0,100.0,2...| 27.47798325483925|
|        4|       71.0|        65|  1836|      21.0|32.0|[4.0,71.0,65.0,18...|31.885279946317613|
|        4|       78.0|        52|  1985|      19.4|32.8|[4.0,78.0,52.0,19...|31.758245176657397|
|        4|       79.0|        67|  1950|      19.0|31.0|[4.0,79.0,67.0,19...|31.214891253338095|
|        4|       79.0|        67|  1963|      15.5|26.0|[4.0,79.0,67.0,19...| 31.23196056704461|
|        4|       85.0|        52|  2035|      22.2|29.0|[4.0,85.0,52.0,20...|31.405272620712147|
|        4|       85.0|        70|  1990|      17.0|32.0|[4.0,85.0,70.0,19...|30.887264695322642|
|        4|       88

In [32]:
evaluator = RegressionEvaluator(labelCol='prediction',predictionCol='MPG', metricName='rmse')
rmse = evaluator.evaluate(predictions)

In [33]:
rmse

4.430141003331048

In [37]:
print(f"Coefficients of the regression model: {model.coefficients}")
print(f"Intercept of the regression model: {model.intercept}")

Coefficients of the regression model: [-0.12089485076232993,-0.0036834676727725943,-0.0487344280971569,-0.0051909409150766765,-0.024157584457859215]
Intercept of the regression model: 45.83600017414481


In [38]:
spark.stop()