# Machine Learning with PySpark

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('Machine_Learning').getOrCreate()
spark

In [5]:
train_set = spark.read.csv('data/names_and_ages.csv', header = True , inferSchema = True, sep = ";")
train_set.show()

+-------+---+----------+----------+------+--------------------+
|   Name|Age|Experience|Salary_USD|ID job|    Current Position|
+-------+---+----------+----------+------+--------------------+
|  Alice| 25|         2|      2911|     9|    Graphic Designer|
|    Bob| 30|         4|      3443|     2|      Data Scientist|
|Charlie| 22|         7|      7034|     5|   Marketing Manager|
|  David| 35|        12|      9118|     6|   Financial Analyst|
|   Emma| 28|         9|     12455|     7|Human Resources S...|
|  Frank| 40|         1|      8372|     1|   Software Engineer|
|  Grace| 23|         3|      2443|     7|Human Resources S...|
|  Henry| 32|        14|      7750|    10|  Operations Manager|
|  Irene| 27|        25|      3635|     7|Human Resources S...|
|   Jack| 33|         2|     15356|     9|    Graphic Designer|
|  Karen| 26|         3|      1940|     7|Human Resources S...|
|    Leo| 29|         1|      4865|     8|Customer Service ...|
|  Maria| 31|         0|      1883|     

In [6]:
train_set.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary_USD: integer (nullable = true)
 |-- ID job: integer (nullable = true)
 |-- Current Position: string (nullable = true)



In [7]:
train_set.columns

['Name', 'Age', 'Experience', 'Salary_USD', 'ID job', 'Current Position']

**Creating a vector assembler**

[Age, Experience] ----> new feature ----> independent feature

**FeatureAssembler** is a transformer used to assemble feature vectors from individual columns of a DataFrame. It combines a given list of columns into a single vector column, which is often required as input for machine learning algorithms in PySpark.

In [23]:
from pyspark.ml.feature import VectorAssembler
featureassembler = VectorAssembler(inputCols = ["Age" , "Experience"] , outputCol="Independent Features")

The **transform()** method of FeatureAssembler is used to transform a DataFrame by assembling feature vectors from the specified input columns. It takes the DataFrame as input and returns a new DataFrame with an additional column containing the assembled feature vectors.

In [24]:
output = featureassembler.transform(train_set)

In [25]:
output.show()

+-------+---+----------+----------+------+--------------------+--------------------+
|   Name|Age|Experience|Salary_USD|ID job|    Current Position|Independent Features|
+-------+---+----------+----------+------+--------------------+--------------------+
|  Alice| 25|         2|      2911|     9|    Graphic Designer|          [25.0,2.0]|
|    Bob| 30|         4|      3443|     2|      Data Scientist|          [30.0,4.0]|
|Charlie| 22|         7|      7034|     5|   Marketing Manager|          [22.0,7.0]|
|  David| 35|        12|      9118|     6|   Financial Analyst|         [35.0,12.0]|
|   Emma| 28|         9|     12455|     7|Human Resources S...|          [28.0,9.0]|
|  Frank| 40|         1|      8372|     1|   Software Engineer|          [40.0,1.0]|
|  Grace| 23|         3|      2443|     7|Human Resources S...|          [23.0,3.0]|
|  Henry| 32|        14|      7750|    10|  Operations Manager|         [32.0,14.0]|
|  Irene| 27|        25|      3635|     7|Human Resources S...|  

In [26]:
output.columns

['Name',
 'Age',
 'Experience',
 'Salary_USD',
 'ID job',
 'Current Position',
 'Independent Features']

In [29]:
data = output.select("Independent features" , "Salary_USD")

In [30]:
data.show()

+--------------------+----------+
|Independent features|Salary_USD|
+--------------------+----------+
|          [25.0,2.0]|      2911|
|          [30.0,4.0]|      3443|
|          [22.0,7.0]|      7034|
|         [35.0,12.0]|      9118|
|          [28.0,9.0]|     12455|
|          [40.0,1.0]|      8372|
|          [23.0,3.0]|      2443|
|         [32.0,14.0]|      7750|
|         [27.0,25.0]|      3635|
|          [33.0,2.0]|     15356|
|          [26.0,3.0]|      1940|
|          [29.0,1.0]|      4865|
|          [31.0,0.0]|      1883|
|          [37.0,0.0]|      7096|
|          [24.0,3.0]|      6736|
|          [38.0,1.0]|      2120|
|         [21.0,14.0]|     16975|
|          [34.0,3.0]|      2238|
|          [39.0,6.0]|      6936|
|         [36.0,15.0]|      7815|
+--------------------+----------+
only showing top 20 rows



We are going to do the train test split

In [33]:
from pyspark.ml.regression import LinearRegression
# Train Test split
train_data, test_data = data.randomSplit([0.75,0.25])
regressor = LinearRegression(featuresCol='Independent features', labelCol = 'Salary_USD')
regressor =regressor.fit(train_data)

In [34]:
# The coefficient of the linear regression
regressor.coefficients

DenseVector([-92.5701, 134.5087])

In [35]:
# The Intercepts of the linear regression
regressor.intercept

10163.755934052866

In [38]:
# The Prediction of the linear regression
pred = regressor.evaluate(test_data)
pred.predictions.show()

+--------------------+----------+------------------+
|Independent features|Salary_USD|        prediction|
+--------------------+----------+------------------+
|         [21.0,14.0]|     16975|10102.905778005199|
|          [24.0,3.0]|      6736| 8345.600400170799|
|          [24.0,8.0]|     14278| 9018.143668785806|
|          [25.0,2.0]|      2911| 8118.521684154004|
|          [26.0,3.0]|      1940| 8160.460275583211|
|          [27.0,0.0]|      6996| 7664.364252120412|
|          [29.0,1.0]|      4865| 7613.732781255824|
|          [29.0,2.0]|      6746| 7748.241434978825|
|         [29.0,19.0]|     11555|10034.888548269848|
|          [30.0,4.0]|      3443| 7924.688680131034|
|          [31.0,8.0]|     12362| 8370.153232729244|
|          [33.0,2.0]|     15356| 7377.961185803647|
|          [33.0,5.0]|     12213| 7781.487146972651|
|          [36.0,7.0]|      1965| 7772.794267537271|
+--------------------+----------+------------------+



In [39]:
pred.meanAbsoluteError,pred.meanSquaredError

(4128.561729668928, 21975051.169756234)