In [1]:
! pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | done
[?25h  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812364 sha256=9e6b183eb11f07028050dd5d3aff311b7fc8813b598ca678145c1a6f2f5edce0
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


# **Importing Necessary Libraries**

In [2]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


# **Creating The Session**

In [3]:
spark = SparkSession.builder.appName('Predict Salaries').getOrCreate()

spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/23 08:01:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## **Loading The Dataset**

In [4]:
training = spark.read.csv('/kaggle/input/pyspark-predict-salaries/Pyspark _ Predict Salary - Sheet1.csv', inferSchema = True, header = True)

                                                                                

In [5]:
training.show()

+---------+---+----------+------+
|     Name|Age|Experience|Salary|
+---------+---+----------+------+
|     Adam| 32|         8| 17000|
|     John| 31|         7| 16000|
|    Chris| 34|        10| 30000|
|  Charles| 21|         3| 12000|
|     Paul| 24|         5| 15000|
|    David| 25|         5| 15000|
|  Indiana| 43|        12| 32000|
|    Linda| 32|         8| 17000|
|Elizabwth| 20|         2| 11000|
+---------+---+----------+------+



# **Printing Schema**

In [6]:
training.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



# **Getting The Columns Names**

In [7]:
training.columns

['Name', 'Age', 'Experience', 'Salary']

# **Importing Vector Assembler**

It is a kind of **tool** that helps us to **combine multiple features** in only **one vector**

In [8]:
from pyspark.ml.feature import VectorAssembler as va

In the **Down Code** We are **combining The Age and the Experience Columns In only One Vector which called Independent Feature** and that **Independent Feature will be use for training our Model**

In [9]:
feature_assembler = va(inputCols = ['Age', 'Experience'],outputCol = 'Independent Features')


In [10]:
output = feature_assembler.transform(training)

In the **Above Line of Code** **transform method** creates a **New DataFrame with the Additional Column** Called **Independent Feature** that **Combined Feature Vectors**

In [11]:
output.show()

+---------+---+----------+------+--------------------+
|     Name|Age|Experience|Salary|Independent Features|
+---------+---+----------+------+--------------------+
|     Adam| 32|         8| 17000|          [32.0,8.0]|
|     John| 31|         7| 16000|          [31.0,7.0]|
|    Chris| 34|        10| 30000|         [34.0,10.0]|
|  Charles| 21|         3| 12000|          [21.0,3.0]|
|     Paul| 24|         5| 15000|          [24.0,5.0]|
|    David| 25|         5| 15000|          [25.0,5.0]|
|  Indiana| 43|        12| 32000|         [43.0,12.0]|
|    Linda| 32|         8| 17000|          [32.0,8.0]|
|Elizabwth| 20|         2| 11000|          [20.0,2.0]|
+---------+---+----------+------+--------------------+



In [12]:
output.columns

['Name', 'Age', 'Experience', 'Salary', 'Independent Features']

# **Finalizing The Features For Training Our Model**

In [13]:
finalized_feature = output.select('Independent Features', 'Salary')

In [14]:
finalized_feature.show()

+--------------------+------+
|Independent Features|Salary|
+--------------------+------+
|          [32.0,8.0]| 17000|
|          [31.0,7.0]| 16000|
|         [34.0,10.0]| 30000|
|          [21.0,3.0]| 12000|
|          [24.0,5.0]| 15000|
|          [25.0,5.0]| 15000|
|         [43.0,12.0]| 32000|
|          [32.0,8.0]| 17000|
|          [20.0,2.0]| 11000|
+--------------------+------+



In the Above Output, **Age and Experience Are Combine and These are the Independent Features** and The **Salary Column Is A Dependent Feature!**

# **Importing Linear Regression Algorithm**

In [15]:
from pyspark.ml.regression import LinearRegression as lr

# **Splitng The Data Into Training And Testing**

In [16]:
train_data, test_data = finalized_feature.randomSplit([0.75, 0.25])

regressor = lr(featuresCol = 'Independent Features', labelCol = 'Salary', regParam = 0.01)

regressor = regressor.fit(train_data)

24/09/23 08:02:02 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/09/23 08:02:02 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


* In the **Above Code** We Are **splitting our data into train_data and the test_data**.
* **finalized_feature** contains our **feature** and **randomsplit is use for splitting the data**. Into **two part with the specific amount**.
* In the **regressor** We are giving our **feature Column and the Label Column** which calleds **training data and the result**.
* Now **regressor.fit** is used for **fitting the data into our model**.

In [17]:
regressor.coefficients

DenseVector([-880.4849, 4418.4044])

In [18]:
regressor.intercept

14425.945748850972

# **Evaluating The Model Performance**

In [19]:
pred_result = regressor.evaluate(test_data)

pred_result.predictions.show()

+--------------------+------+------------------+
|Independent Features|Salary|        prediction|
+--------------------+------+------------------+
|          [20.0,2.0]| 11000| 5653.055591073111|
|          [32.0,8.0]| 17000|21597.662420220582|
+--------------------+------+------------------+



# **Getting MAE and MSE**

In [20]:
print("The (MSE) Mean Squared Error is: ",pred_result.meanAbsoluteError)

The (MSE) Mean Squared Error is:  4972.303414573736


In [21]:
print("The (MAE) Mean Absolute Error is:", pred_result.meanSquaredError)

The (MAE) Mean Absolute Error is: 24864157.121231552


In [22]:
import pandas as pd

data = {"Metric": ["MSE", "MAE"], "Value": [pred_result.meanSquaredError, pred_result.meanAbsoluteError]}
df = pd.DataFrame(data)

print(df)

  Metric         Value
0    MSE  2.486416e+07
1    MAE  4.972303e+03
