# Course 1: Introduction to PySpark

## 1. Introduction to PySpark

- Spark is a platform for cluster computing.
- Spark lets you spread data and computations over *clusters* with multiple *nodes*.
- As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed *in parallel* over the nodes in the cluster.
- However, with greater computing power comes greater complexity. Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:
    - Is my data too big to work with on a single machine?
    - Can my calculations be easily parallelized?

## 2. Using Spark in Python

First step: Connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the *master* that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called *worker*. The master sends the workers data and calculations to run, and they send their results back to the master.

Creating the connection is as simple as creating an instance of the `SparkContext` class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.

**Notes:** In `pyspark`, create a `SparkSession` do attach with the creation of a `SparkContext`, as shown below.

An object holding all these attributes can be created with the `SparkConf()` constructor.

In [1]:
from pyspark.sql import SparkSession

# Create SparkSession from builder
# If the sample data you work with is small, you can remove the `.config` call
spark = SparkSession.builder.appName('Spark').config("spark.driver.memory", "15g").getOrCreate()
spark

23/03/29 11:53:49 WARN Utils: Your hostname, Mufins-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.23 instead (on interface en0)
23/03/29 11:53:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/29 11:53:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# Get SparkContext
spark.sparkContext

In [3]:
# Get app name
spark.sparkContext.appName

'Spark'

In [4]:
spark.sparkContext.setLogLevel("OFF")

## 3. Using DataFrames

Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster.

However, RDDs are hard to work with directly. Instead, we can use Spark DataFrame abstraction built on top of RDDs.

The Spark DataFrame was designed to behave a lot like a SQL table. Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs. Some operations, such as modifying and combining colums and rows of data, required scientists to figure out the right way to optimize the query on RDD, but no need on DataFrame.

To start working with Spark DataFrames, you first have to create a `SparkSession` object from your `SparkContext`. You can think of the `SparkContext` as your connection to the cluster and the `SparkSession` as your interface with that connection.

In [5]:
# List all the data inside the cluster
spark.catalog.listTables()

[]

In [6]:
df = spark.read.json("user.json")

                                                                                

In [7]:
df.printSchema()

root
 |-- created_at: string (nullable = true)
 |-- description: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- description: struct (nullable = true)
 |    |    |-- cashtags: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: long (nullable = true)
 |    |    |    |    |-- start: long (nullable = true)
 |    |    |    |    |-- tag: string (nullable = true)
 |    |    |-- hashtags: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: long (nullable = true)
 |    |    |    |    |-- start: long (nullable = true)
 |    |    |    |    |-- tag: string (nullable = true)
 |    |    |-- mentions: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: long (nullable = true)
 |    |    |    |    |-- start: long (nullable = true)
 |    |    |    |    |-- username: string (nullable = true)
 |    |

In [9]:
# Create a temporary table
df.createOrReplaceTempView("user")

                                                                                

In [10]:
spark.catalog.listTables()

[Table(name='user', database='default', description=None, tableType='MANAGED', isTemporary=False)]

In [11]:
# Select the first 10 rows from table `user`
query = "FROM user SELECT * LIMIT 10"

user10 = spark.sql(query)
user10.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+---------+--------------------+--------------------+---------------+--------+--------+
|          created_at|         description|            entities|                  id|            location|                name|    pinned_tweet_id|   profile_image_url|protected|      public_metrics|                 url|       username|verified|withheld|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+---------+--------------------+--------------------+---------------+--------+--------+
|2020-01-16 02:02:...|Theoretical Compu...|{{null, null, nul...|u1217628182611927040|       Cambridge, MA|          Boaz Barak|               null|https://pbs.twimg...|    false|{7316, 215, 69, 3...|https://t.co/BoMi...|   boazbaraktcs

In [12]:
# Convert into Pandas DataFrame
user10_df = user10.toPandas()
user10_df.head()

Unnamed: 0,created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld
0,2020-01-16 02:02:55+00:00,Theoretical Computer Scientist. See also https...,"((None, None, None, [Row(display_url='windowso...",u1217628182611927040,"Cambridge, MA",Boaz Barak,,https://pbs.twimg.com/profile_images/125226236...,False,"(7316, 215, 69, 3098)",https://t.co/BoMip9FF17,boazbaraktcs,False,
1,2014-07-02 17:56:46+00:00,creative _,,u2664730894,🎈,olawale 💨,,https://pbs.twimg.com/profile_images/147837638...,False,"(123, 1090, 0, 1823)",,wale_io,False,
2,2020-05-30 12:10:45+00:00,👽,,u1266703520205549568,,panagiota_.b,,https://pbs.twimg.com/profile_images/142608606...,False,"(3, 62, 0, 66)",,b_panagiota,False,
3,2019-01-26 13:52:49+00:00,mama to maya. ABIM research pathway fellow @UV...,"((None, None, [Row(end=50, start=43, username=...",u1089159225148882949,"Charlottesville, VA","Jacqueline Hodges, MD MPH",,https://pbs.twimg.com/profile_images/130229171...,False,"(350, 577, 1, 237)",,jachodges_md,False,
4,2009-04-30 19:01:42+00:00,Father / SWT Alumnus / Longhorn Fan,,u36741729,United States,Matthew Stubblefield,,https://pbs.twimg.com/profile_images/145808462...,True,"(240, 297, 8, 3713)",,Matthew_Brody,False,


In [13]:
# Move data from pandas to Spark
# Import library
import pandas as pd
import numpy as np

# Create pd_temp
pd_temp = pd.DataFrame(np.random.random(10))

# Create spark_temp from pd_temp
spark_temp = spark.createDataFrame(pd_temp)

# Examine the tables in the catalog
print("Before adding: ", spark.catalog.listTables())

# Add spark_temp to the catalog
spark_temp.createOrReplaceTempView("temp")

# Examine the tables in the catalog again
print("After adding: ", spark.catalog.listTables())

Before adding:  [Table(name='user', database='default', description=None, tableType='MANAGED', isTemporary=False)]
After adding:  [Table(name='user', database='default', description=None, tableType='MANAGED', isTemporary=False), Table(name='temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


**Notes:** The `.createTempView()` method takes as its only argument the name of the temporary table you'd like to register. This method registers the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed from the specific `SparkSession` used to create the Spark DataFrame.

To avoid duplication, there is also the method `.createOrReplaceTempView()`.

In [14]:
# Create the DataFrame user
user = spark.table("user")

# Show the head
user.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+---------+--------------------+--------------------+---------------+--------+--------+
|          created_at|         description|            entities|                  id|            location|                name|    pinned_tweet_id|   profile_image_url|protected|      public_metrics|                 url|       username|verified|withheld|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+---------+--------------------+--------------------+---------------+--------+--------+
|2020-01-16 02:02:...|Theoretical Compu...|{{null, null, nul...|u1217628182611927040|       Cambridge, MA|          Boaz Barak|               null|https://pbs.twimg...|    false|{7316, 215, 69, 3...|https://t.co/BoMi...|   boazbaraktcs

In [15]:
# Add following / follower ratio
user = user.withColumn("reputation", user.public_metrics.following_count / user.public_metrics.followers_count)

In [16]:
# SQL on DataFrame

# Get user with a specific IDs
user_with_id_str = user.filter("public_metrics.following_count >= 1000")
user_with_id_str.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+---------+--------------------+--------------------+---------------+--------+--------+--------------------+
|          created_at|         description|            entities|                  id|            location|                name|    pinned_tweet_id|   profile_image_url|protected|      public_metrics|                 url|       username|verified|withheld|          reputation|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+---------+--------------------+--------------------+---------------+--------+--------+--------------------+
|2014-07-02 17:56:...|          creative _|                null|         u2664730894|                  🎈|          olawale 💨|               null|https://pbs.twimg...|    fa

In [17]:
# Get user with a column of boolean value
user_with_id_bool = user.filter(user.public_metrics.following_count >= 1000)
user_with_id_bool.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+---------+--------------------+--------------------+---------------+--------+--------+--------------------+
|          created_at|         description|            entities|                  id|            location|                name|    pinned_tweet_id|   profile_image_url|protected|      public_metrics|                 url|       username|verified|withheld|          reputation|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+---------+--------------------+--------------------+---------------+--------+--------+--------------------+
|2014-07-02 17:56:...|          creative _|                null|         u2664730894|                  🎈|          olawale 💨|               null|https://pbs.twimg...|    fa

In [18]:
# Select columns
user_selected_columns = user.select("reputation", "public_metrics.following_count", "public_metrics.followers_count")
user_selected_columns.show()

+--------------------+---------------+---------------+
|          reputation|following_count|followers_count|
+--------------------+---------------+---------------+
|0.029387643521049753|            215|           7316|
|   8.861788617886178|           1090|            123|
|  20.666666666666668|             62|              3|
|  1.6485714285714286|            577|            350|
|              1.2375|            297|            240|
|               378.0|            378|              1|
|0.001229779982707...|            293|         238254|
|  1.7267080745341614|            278|            161|
| 0.21031168074570347|           2166|          10299|
|  1.0737327188940091|            233|            217|
|                null|            136|              0|
|   3.019607843137255|            154|             51|
|  0.0315131104161655|            393|          12471|
|4.020519664678650...|            241|        5994250|
|1.631001352152734E-4|           1302|        7982826|
|  0.57793

In [19]:
user_selected_columns_colName = user.select(user.reputation, user.public_metrics.following_count, user.public_metrics.followers_count)
user_selected_columns_colName.show()

+--------------------+------------------------------+------------------------------+
|          reputation|public_metrics.following_count|public_metrics.followers_count|
+--------------------+------------------------------+------------------------------+
|0.029387643521049753|                           215|                          7316|
|   8.861788617886178|                          1090|                           123|
|  20.666666666666668|                            62|                             3|
|  1.6485714285714286|                           577|                           350|
|              1.2375|                           297|                           240|
|               378.0|                           378|                             1|
|0.001229779982707...|                           293|                        238254|
|  1.7267080745341614|                           278|                           161|
| 0.21031168074570347|                          2166|            

In [20]:
# Derive a new column from select
user_reputation_select = (user.public_metrics.following_count / user.public_metrics.followers_count).alias("reputation_")
user_reputation_select

Column<'(public_metrics[following_count] / public_metrics[followers_count]) AS reputation_'>

In [21]:
rep_selected = user.select("public_metrics.following_count", "public_metrics.followers_count", user_reputation_select)
rep_selected.show()

+---------------+---------------+--------------------+
|following_count|followers_count|         reputation_|
+---------------+---------------+--------------------+
|            215|           7316|0.029387643521049753|
|           1090|            123|   8.861788617886178|
|             62|              3|  20.666666666666668|
|            577|            350|  1.6485714285714286|
|            297|            240|              1.2375|
|            378|              1|               378.0|
|            293|         238254|0.001229779982707...|
|            278|            161|  1.7267080745341614|
|           2166|          10299| 0.21031168074570347|
|            233|            217|  1.0737327188940091|
|            136|              0|                null|
|            154|             51|   3.019607843137255|
|            393|          12471|  0.0315131104161655|
|            241|        5994250|4.020519664678650...|
|           1302|        7982826|1.631001352152734E-4|
|         

In [22]:
# Use selectExpr
fol_selected_expr = user.selectExpr(
    "public_metrics.following_count",
    "public_metrics.followers_count",
    "public_metrics.following_count + 100 as following_bonus"
)
fol_selected_expr.show()

+---------------+---------------+---------------+
|following_count|followers_count|following_bonus|
+---------------+---------------+---------------+
|            215|           7316|            315|
|           1090|            123|           1190|
|             62|              3|            162|
|            577|            350|            677|
|            297|            240|            397|
|            378|              1|            478|
|            293|         238254|            393|
|            278|            161|            378|
|           2166|          10299|           2266|
|            233|            217|            333|
|            136|              0|            236|
|            154|             51|            254|
|            393|          12471|            493|
|            241|        5994250|            341|
|           1302|        7982826|           1402|
|            990|           1713|           1090|
|           1206|          45541|           1306|


In [23]:
# Aggregation
user.groupBy().min("public_metrics.following_count").show()

+------------------------------------------------------+
|min(public_metrics.following_count AS following_count)|
+------------------------------------------------------+
|                                                     0|
+------------------------------------------------------+



In [24]:
user.groupBy().max("public_metrics.following_count").show()

+------------------------------------------------------+
|max(public_metrics.following_count AS following_count)|
+------------------------------------------------------+
|                                               4161031|
+------------------------------------------------------+



In [25]:
user.groupBy().avg("public_metrics.following_count").show()

+------------------------------------------------------+
|avg(public_metrics.following_count AS following_count)|
+------------------------------------------------------+
|                                           2250.534189|
+------------------------------------------------------+



In [26]:
# Get number of protected accounts
user.groupBy("protected").count().show()

+---------+------+
|protected| count|
+---------+------+
|     true| 30336|
|    false|969664|
+---------+------+



In [27]:
# Grouping and aggregating
user.groupBy("protected", "verified").avg("public_metrics.following_count").show()

+---------+--------+------------------------------------------------------+
|protected|verified|avg(public_metrics.following_count AS following_count)|
+---------+--------+------------------------------------------------------+
|     true|   false|                                    1191.1624875290988|
|     true|    true|                                    2773.8308270676694|
|    false|   false|                                     2102.914040881294|
|    false|    true|                                    3940.9711979144768|
+---------+--------+------------------------------------------------------+



In [28]:
# External functions
import pyspark.sql.functions as F

# Grouping and aggregating
user.groupBy("protected", "verified").agg(F.stddev("public_metrics.following_count")).show()

+---------+--------+-------------------------------------------+
|protected|verified|stddev_samp(public_metrics.following_count)|
+---------+--------+-------------------------------------------+
|     true|   false|                           4907.96237482749|
|     true|    true|                          5632.911627343465|
|    false|   false|                         11902.516160950481|
|    false|    true|                          36122.61662363998|
+---------+--------+-------------------------------------------+



In [29]:
# Generate label table
label = spark.read.csv("label.csv", header=True)
label.printSchema()

root
 |-- id: string (nullable = true)
 |-- label: string (nullable = true)



In [30]:
# Joining
user_join_label = user.join(label, on="id", how="leftouter")
# Note that the schema is having the "label" column
user_join_label.printSchema()

root
 |-- id: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- description: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- description: struct (nullable = true)
 |    |    |-- cashtags: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: long (nullable = true)
 |    |    |    |    |-- start: long (nullable = true)
 |    |    |    |    |-- tag: string (nullable = true)
 |    |    |-- hashtags: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: long (nullable = true)
 |    |    |    |    |-- start: long (nullable = true)
 |    |    |    |    |-- tag: string (nullable = true)
 |    |    |-- mentions: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end: long (nullable = true)
 |    |    |    |    |-- start: long (nullable = true)
 |    |    |    |    |-- username

## 3. Getting started with Machine Learning pipelines

In [31]:
# Refactor dataframe since Pipeline does not work with nested structure
# We can use wildcard (*) to select all features at lower level
user_refactored = user_join_label.select("public_metrics.*", "location", "label")
user_refactored.printSchema()

root
 |-- followers_count: long (nullable = true)
 |-- following_count: long (nullable = true)
 |-- listed_count: long (nullable = true)
 |-- tweet_count: long (nullable = true)
 |-- location: string (nullable = true)
 |-- label: string (nullable = true)



In [32]:
# Rename label column so that indicator can use the name `label`
user_refactored = user_refactored.withColumnRenamed("label", "label_name")
user_refactored = user_refactored.withColumn("label_indicator", user_refactored.label_name == 'bot')
user_refactored = user_refactored.withColumn("label", user_refactored.label_indicator.cast("integer"))
user_refactored.printSchema()

root
 |-- followers_count: long (nullable = true)
 |-- following_count: long (nullable = true)
 |-- listed_count: long (nullable = true)
 |-- tweet_count: long (nullable = true)
 |-- location: string (nullable = true)
 |-- label_name: string (nullable = true)
 |-- label_indicator: boolean (nullable = true)
 |-- label: integer (nullable = true)



In [33]:
user_refactored.show()



+---------------+---------------+------------+-----------+--------------------+----------+---------------+-----+
|followers_count|following_count|listed_count|tweet_count|            location|label_name|label_indicator|label|
+---------------+---------------+------------+-----------+--------------------+----------+---------------+-----+
|           6785|           6390|           4|       5022|Pachuca de Soto, ...|       bot|           true|    1|
|            439|            569|           1|        183|  West Lafayette, IN|     human|          false|    0|
|           1312|           1288|           2|      42392|                null|     human|          false|    0|
|             87|             42|          53|       2625|    India, Bangalore|     human|          false|    0|
|             30|            204|           0|        401|       Iowa City, IA|     human|          false|    0|
|          15324|           1201|          51|      17298|         God's Earth|     human|      

                                                                                

In [34]:
# Remove na columns
user_refactored = user_refactored.na.drop("any")
user_refactored.show()



+---------------+---------------+------------+-----------+--------------------+----------+---------------+-----+
|followers_count|following_count|listed_count|tweet_count|            location|label_name|label_indicator|label|
+---------------+---------------+------------+-----------+--------------------+----------+---------------+-----+
|           6785|           6390|           4|       5022|Pachuca de Soto, ...|       bot|           true|    1|
|            439|            569|           1|        183|  West Lafayette, IN|     human|          false|    0|
|             87|             42|          53|       2625|    India, Bangalore|     human|          false|    0|
|             30|            204|           0|        401|       Iowa City, IA|     human|          false|    0|
|          15324|           1201|          51|      17298|         God's Earth|     human|          false|    0|
|          30610|           2008|          43|      54595|              España|     human|      

                                                                                

In [35]:
# One Hot Encoding
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

loc_indexer = StringIndexer(inputCol="location", outputCol="location_index")
loc_encoder = OneHotEncoder(inputCol="location_index", outputCol="location_fact")

vec_assembler = VectorAssembler(
    inputCols=[
        "following_count",
        "followers_count",
        "location_fact",
    ],
    outputCol="features"
)

In [36]:
# Define a pipeline
from pyspark.ml import Pipeline

sobog_pipeline = Pipeline(stages=[
    loc_indexer,
    loc_encoder,
    vec_assembler
])

In [37]:
piped_data = sobog_pipeline.fit(user_refactored).transform(user_refactored)
piped_data

                                                                                

DataFrame[followers_count: bigint, following_count: bigint, listed_count: bigint, tweet_count: bigint, location: string, label_name: string, label_indicator: boolean, label: int, location_index: double, location_fact: vector, features: vector]

In [38]:
# Train test split
training, test = piped_data.randomSplit([0.6, 0.4])
training, test

(DataFrame[followers_count: bigint, following_count: bigint, listed_count: bigint, tweet_count: bigint, location: string, label_name: string, label_indicator: boolean, label: int, location_index: double, location_fact: vector, features: vector],
 DataFrame[followers_count: bigint, following_count: bigint, listed_count: bigint, tweet_count: bigint, location: string, label_name: string, label_indicator: boolean, label: int, location_index: double, location_fact: vector, features: vector])

In [39]:
# Import LogisticRegression
from pyspark.ml.classification import LogisticRegression

# Create a LogisticRegression Estimator
lr = LogisticRegression()

In [40]:
# Import the evaluation submodule
import pyspark.ml.evaluation as evals

# Create a BinaryClassificationEvaluator
evaluator = evals.BinaryClassificationEvaluator(metricName="areaUnderROC")

In [44]:
# Make a grid (hyperparameter tuning)
# Import the tuning submodule
import pyspark.ml.tuning as tune

# Create the parameter grid
grid = tune.ParamGridBuilder()

# Add the hyperparameter
grid = grid.addGrid(lr.elasticNetParam, [0, 1])

# Build the grid
grid = grid.build()

In [45]:
# Create the CrossValidator
cv = tune.CrossValidator(estimator=lr,
               estimatorParamMaps=grid,
               evaluator=evaluator)

In [46]:
# Fit cross validation models
models = cv.fit(training)

# Extract the best model
best_lr = models.bestModel

ConnectionRefusedError: [Errno 61] Connection refused

In [None]:
# Use the model to predict the test set
test_results = best_lr.transform(test)

# Evaluate the predictions
print(evaluator.evaluate(test_results))

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/Users/mufin/opt/anaconda3/lib/python3.9/site-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mufin/opt/anaconda3/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/Users/mufin/opt/anaconda3/lib/python3.9/site-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
