### Examples Of Pyspark ML

In [1]:
# Connecting google colab with google dive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Get the file path to read the 'PySpark_test1.csv' file
file_path = '/content/drive/MyDrive/Datasets/PySpark_test1.csv'

In [3]:
# Importing the SparkSession and creating the spark context as spark with 'BaseML' app name
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('BaseML').getOrCreate()

In [5]:
# Read the 'PySpark_test1.csv' dataset as train_df
train_df = spark.read.csv(file_path, header = True, inferSchema = True)

In [6]:
# Get the
train_df.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [7]:
# Get the schema of the train data set
train_df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [8]:
# Get the column names of the train_df pyspark data frame
train_df.columns

['Name', 'age', 'Experience', 'Salary']

#### PySpark's MLlib has some different functionalities than Sklearn's MLlibs
- In case of Sklearn, we devide the dataset into **independent and dependent set** using variables X and y
- Then we perform **train test split** on that X and y with **80-20** or **70-30** ratio for train and test set preparation
- We **build the required ML model** and check its **accuracy**. **Hypertune** in future if required.

But in case of PySpark's MLlib construction, we find a way in which we can somehow **group the independent variables**. In other word, we **assemble** the columns that are considered as independent variable.

For this notebook, the **'age' and 'Experience'** columns are **independent columns** and **'Salary'** is the **dependent** variable. So, by grouping the **'age'** and **'Experience'** feature, it will **ceate a new feature** that is called as **'IndF'** and it will be given as **[31,10]** for **age 31** and **experience of 10 years**

**Task:** The task is to build a simple ML model that can **predict salary** of an individual with the help of his **age** and **experience** in respective field.

In [10]:
# Importing VectorAssembler to assemble the independent columns for the ML model
from pyspark.ml.feature import VectorAssembler
features = VectorAssembler(inputCols=["age","Experience"], outputCol= "IndF")

In [11]:
# Use the transform method to create a new pyspark data frame called output
# It will create a new feature called
output = features.transform(train_df)

In [12]:
# Get the output of the output pyspark data frame object
output.show()

+---------+---+----------+------+-----------+
|     Name|age|Experience|Salary|       IndF|
+---------+---+----------+------+-----------+
|    Krish| 31|        10| 30000|[31.0,10.0]|
|Sudhanshu| 30|         8| 25000| [30.0,8.0]|
|    Sunny| 29|         4| 20000| [29.0,4.0]|
|     Paul| 24|         3| 20000| [24.0,3.0]|
|   Harsha| 21|         1| 15000| [21.0,1.0]|
|  Shubham| 23|         2| 18000| [23.0,2.0]|
+---------+---+----------+------+-----------+



In [13]:
# Get the columns of the output pyspark data frame object
output.columns

['Name', 'age', 'Experience', 'Salary', 'IndF']

In [14]:
# Creating the final pysaprk data frame for building the ML object
# Store the final data frame as final_df
final_df = output.select('IndF', 'Salary')

In [15]:
# Check how the final data frame for building the ML object looks like
final_df.show()

+-----------+------+
|       IndF|Salary|
+-----------+------+
|[31.0,10.0]| 30000|
| [30.0,8.0]| 25000|
| [29.0,4.0]| 20000|
| [24.0,3.0]| 20000|
| [21.0,1.0]| 15000|
| [23.0,2.0]| 18000|
+-----------+------+



### Future Steps
- Once we get the final data for building the ML model (Linear Regression in this case), we divide the data into two parts. The train part and the test part
- The train part is build with 75% of the sample, the test part is build with 25% of the sample available.
- We define the regressor variable using the LinearRegression object available in the pyspark.ml.regression moduloe
- Fit the train data to the model
- Get the slope and intercept as coefficients
- Predict results with test data


In [16]:
# Importing the LinearRegression object for building the model
from pyspark.ml.regression import LinearRegression

# Train test split the data with 75-25 split as train_data and test_data
train_data,test_data=final_df.randomSplit([0.75,0.25])
regressor = LinearRegression(featuresCol = 'IndF', labelCol = 'Salary')
regressor = regressor.fit(train_data)

In [17]:
# Get the Coefficients (slope) of the regressor model using the .coefficients arguement
regressor.coefficients

DenseVector([109.3058, 1199.4092])

In [18]:
# Get the intercept of the regressor model
regressor.intercept

12187.59231905408

In [19]:
# Predicting the results with the regressor model and by using the test_data; store the result in a variable called pred
pred = regressor.evaluate(test_data)

In [20]:
# Get the predicted results and show them as a data frame
pred.predictions.show()

+-----------+------+------------------+
|       IndF|Salary|        prediction|
+-----------+------+------------------+
| [24.0,3.0]| 20000|18409.158050221544|
|[31.0,10.0]| 30000| 27570.16248153613|
+-----------+------+------------------+



In [21]:
# Get the mean Squared error and Mean Absoute error for the predicted results
print('The MAE value will be:', pred.meanAbsoluteError)
print('The MSE value will be:', pred.meanSquaredError)

The MAE value will be: 2010.339734121164
The MSE value will be: 4217444.237654793
