# **[PySpark Tutorial for Beginners: Machine Learning Example](https://www.guru99.com/pyspark-tutorial.html)**

## **What is Apache Spark?**

Spark is a big data solution that has been proven to be easier and faster than Hadoop MapReduce. Spark is an open source software developed by UC Berkeley RAD lab in 2009. Since it was released to the public in 2010, Spark has grown in popularity and is used through the industry with an unprecedented scale.  
  
In the era of big data, practitioners need more than ever fast and reliable tools to process streaming of data. Earlier tools like MapReduce were favorite but were slow. To overcome this issue, Spark offers a solution that is both fast and general-purpose. The main difference between Spark and MapReduce is that Spark runs computations in memory during the later on the hard disk. It allows high-speed access and data processing, reducing times from hours to minutes.  

### What is Pyspark?

Spark is the name of the engine to realize cluster computing while PySpark is the Python's library to use Spark. 

## **How Does Spark work?**

Spark is based on computational engine, meaning it takes care of the scheduling, distributing and monitoring application. Each task is done across various worker machines called computing cluster. A computing cluster refers to the division of tasks. One machine performs one task, while the others contribute to the final output through a different task. In the end, all the tasks are aggregated to produce an output. The Spark admin gives a 360 overview of various Spark Jobs.  
  
Spark is designed to work with

- Python
- Java
- Scala
- SQL

A significant feature of Spark is the vast amount of built-in library, including MLlib for machine learning. Spark is also designed to work with Hadoop clusters and can read the broad type of files, including Hive data, CSV, JSON, Casandra data among other. 

### Why use Spark?

As a future data practitioner, you should be familiar with python's famous libraries: Pandas and scikit-learn. These two libraries are fantastic to explore dataset up to mid-size. Regular machine learning projects are built around the following methodology:

- Load the data to the disk
- Import the data into the machine's memory
- Process/analyze the data
- Build the machine learning model
- Store the prediction back to disk

The problem arises if the data scientist wants to process data that's too big for one computer. During earlier days of data science, the practitioners would sample the as training on huge data sets was not always needed. The data scientist would find a good statistical sample, perform an additional robustness check and comes up with an excellent model.

However, there are some problems with this:

- Is the dataset reflecting the real world?
- Does the data include a specific example?
- Is the model fit for sampling?

Take users recommendation for instance. Recommenders rely on comparing users with other users in evaluating their preferences. If the data practitioner takes only a subset of the data, there won't be a cohort of users who are very similar to one another. Recommenders need to run on the full dataset or not at all. 

### What is the solution?

he solution has been evident for a long time, split the problem up onto multiple computers. Parallel computing comes with multiple problems as well. Developers often have trouble writing parallel code and end up having to solve a bunch of the complex issues around multi-processing itself.

Pyspark gives the data scientist an API that can be used to solve the parallel data proceedin problems. Pyspark handles the complexities of multiprocessing, such as distributing the data, distributing code and collecting output from the workers on a cluster of machines.

Spark can run standalone but most often runs on top of a cluster computing framework such as Hadoop. In test and development, however, a data scientist can efficiently run Spark on their development boxes or laptops without a cluster

- One of the main advantages of Spark is to build an architecture that encompasses data streaming management, seamlessly data queries, machine learning prediction and real-time access to various analysis.

- Spark works closely with SQL language, i.e., structured data. It allows querying the data in real time.

- Data scientist main's job is to analyze and build predictive models. In short, a data scientist needs to know how to query data using SQL, produce a statistical report and make use of machine learning to produce predictions. Data scientist spends a significant amount of their time on cleaning, transforming and analyzing the data. Once the dataset or data workflow is ready, the data scientist uses various techniques to discover insights and hidden patterns. The data manipulation should be robust and the same easy to use. Spark is the right tool thanks to its speed and rich APIs.

In this tutorial, you will learn how to build a classifier with Pyspark. 

## **Spark Context**

SparkContext is the internal engine that allows the connections with the clusters. If you want to run an operation, you need a SparkContext. 

### Create a SparkContext

First of all, you need to initiate a SparkContext. 

In [1]:
import pyspark

In [3]:
sc = pyspark.SparkContext()

In [4]:
sc

Now that the SparkContext is ready, you can create a collection of data called RDD, Resilient Distributed Dataset. Computation in an RDD is automatically parallelized across the cluster. 

In [5]:
nums = sc.parallelize([1, 2, 3, 4])

You can access the first row with take 

In [7]:
nums.take(1)

[1]

You can apply a transformation to the data with a lambda function. In the example below, you return the square of nums. It is a map transformation 

In [8]:
squared = nums.map(lambda x: x * x).collect()

for num in squared:
    print(num)

1
4
9
16


## **SQLContext**

A more convenient way is to use the DataFrame. SparkContext is already set, you can use it to create the dataFrame. You also need to declare the SQLContext  
  
SQLContext allows connecting the engine with different data sources. It is used to initiate the functionalities of Spark SQL. 

In [10]:
from pyspark.sql import Row
from pyspark.sql import SQLContext

In [11]:
sqlContext = SQLContext(sc)

Let's create a list of tuple. Each tuple will contain the name of the people and their age. Four steps are required:  
  
- Step 1) Create the list of tuple with the information 

In [12]:
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]

- Step 2) Build a RDD 

In [13]:
rdd = sc.parallelize(list_p)

- Step 3) Convert the tuples 

In [15]:
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

- Step 4) Create a DataFrame context 

In [16]:
DF_ppl = sqlContext.createDataFrame(ppl)

If you want to access the type of each feature, you can use printSchema() 

In [17]:
DF_ppl.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



You can watch the DataFrame values using show()

In [23]:
DF_ppl.show()

+---+-----+
|age| name|
+---+-----+
| 19| John|
| 29|Smith|
| 35| Adam|
| 50|Henry|
+---+-----+



## **Machine learning with Spark**

Now that you have a brief idea of Spark and SQLContext, you are ready to build your first Machine learning program.

You will proceed as follow:

- Step 1) Basic operation with PySpark
- Step 2) Data preprocessing
- Step 3) Build a data processing pipeline
- Step 4) Build the classifier
- Step 5) Train and evaluate the model
- Step 6) Tune the hyperparameter

In this tutorial, we will use the adult dataset. The purpose of this tutorial is to learn how to use Pyspark. For more information about the dataset, refer to this tutorial.

Note that, the dataset is not significant and you may think that the computation takes a long time. Spark is designed to process a considerable amount of data. Spark's performances increase relative to other machine learning libraries when the dataset processed grows larger. 

### Step 1) Basic operation with PySpark

First of all, you need to initialize the SQLContext is not already in initiated yet. 

In [24]:
url = "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/adult_data.csv"
sc.addFile(url)

In [25]:
sqlContext = SQLContext(sc)

Then, you can read the cvs file with sqlContext.read.csv. You use inferSchema set to True to tell Spark to guess automatically the type of data. By default, it is turn to False. 

In [27]:
df = sqlContext.read.csv(pyspark.SparkFiles.get("adult_data.csv"), header=True, inferSchema=True)

Let's have a look at the data type 

In [28]:
df.printSchema()

root
 |-- x: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- educational-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)



You can see the data with show. 

In [29]:
df.show(5, truncate=False)

+---+---+---------+------+------------+---------------+------------------+-----------------+------------+-----+------+------------+------------+--------------+--------------+------+
|x  |age|workclass|fnlwgt|education   |educational-num|marital-status    |occupation       |relationship|race |gender|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+---+---------+------+------------+---------------+------------------+-----------------+------------+-----+------+------------+------------+--------------+--------------+------+
|1  |25 |Private  |226802|11th        |7              |Never-married     |Machine-op-inspct|Own-child   |Black|Male  |0           |0           |40            |United-States |<=50K |
|2  |38 |Private  |89814 |HS-grad     |9              |Married-civ-spouse|Farming-fishing  |Husband     |White|Male  |0           |0           |50            |United-States |<=50K |
|3  |28 |Local-gov|336951|Assoc-acdm  |12             |Married-civ-spouse|Protective-serv 

#### Select columns

You can select and show the rows with select and the names of the features. Below, age and fnlwgt are selected. 

In [30]:
df.select('age', 'fnlwgt').show(5)

+---+------+
|age|fnlwgt|
+---+------+
| 25|226802|
| 38| 89814|
| 28|336951|
| 44|160323|
| 18|103497|
+---+------+
only showing top 5 rows

