### 01 Develop Analysis Workflow

This notebook aims to develop and test the Spark workflow before implementing everything into dedicated Python files for running on the cluster.

In [56]:
from pathlib import Path
import pandas as pd

Although the intention here is to practice Spark and distributed computing, let's first look at the data in Pandas and use as a sanity check with Spark dataframes.

In [57]:
# Limit how many rows to import for speed
nrows = 100_000

In [58]:
questions = pd.read_csv(Path('./assets/Questions.csv'), nrows=nrows,
                encoding="ISO-8859-1").dropna(subset=["Id", "Body", "CreationDate"])
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Id            100000 non-null  int64  
 1   OwnerUserId   95342 non-null   float64
 2   CreationDate  100000 non-null  object 
 3   ClosedDate    3331 non-null    object 
 4   Score         100000 non-null  int64  
 5   Title         100000 non-null  object 
 6   Body          100000 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 5.3+ MB


In [59]:
questions.sort_values(by="Id").head(5)

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...


In [60]:
answers = pd.read_csv(Path('./assets/Answers.csv'), nrows=nrows,
                encoding="ISO-8859-1").dropna()
answers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96675 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            96675 non-null  int64  
 1   OwnerUserId   96675 non-null  float64
 2   CreationDate  96675 non-null  object 
 3   ParentId      96675 non-null  int64  
 4   Score         96675 non-null  int64  
 5   Body          96675 non-null  object 
dtypes: float64(1), int64(3), object(2)
memory usage: 5.2+ MB


In [61]:
answers.head(5)

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
0,92,61.0,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Vers..."
1,124,26.0,2008-08-01T16:09:47Z,80,12,<p>I wound up using this. It is a kind of a ha...
2,199,50.0,2008-08-01T19:36:46Z,180,1,<p>I've read somewhere the human eye can't dis...
3,269,91.0,2008-08-01T23:49:57Z,260,4,"<p>Yes, I thought about that, but I soon figur..."
4,307,49.0,2008-08-02T01:49:46Z,260,28,"<p><a href=""http://www.codeproject.com/Article..."


In [62]:
tags = pd.read_csv(Path('./assets/Tags.csv'),
                encoding="ISO-8859-1").dropna()
tags.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3749881 entries, 0 to 3750993
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Tag     object
dtypes: int64(1), object(1)
memory usage: 85.8+ MB


In [63]:
tags.head()

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3
2,80,air
3,90,svn
4,90,tortoisesvn


### PySpark: Preprocess

Now let's jump into Spark. We'll start by reading in the csv files. Note that we set a row limit to minimize computations here. When we're ready to run everything on the cluster, the limit will be removed.

In [64]:
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, LongType, StringType
from pyspark.sql import functions as F

spark = (SparkSession.builder
        .master("local[*]")
        .appName("StackOverflow")
        .getOrCreate()
)

In [65]:
df_questions = (spark.read.options(encoding="ISO-8859-1",
                header=True, multiLine=False, mode="DROPMALFORMED")
                .csv("./assets/Questions.csv")
                .limit(10_000)
                )

df_answers = (spark.read.options(encoding="ISO-8859-1", 
                header=True, mode="DROPMALFORMED", multiLine=False)
                .csv('./assets/Answers.csv')
                .limit(10_000)
                )

df_tags = (spark.read.options(encoding="ISO-8859-1", 
            header=True, mode="DROPMALFORMED", multiLine=False)
            .csv('./assets/Tags.csv')
            .limit(10_000)
            )

                                                                                

In [66]:
# Confirm row limits
print(df_questions.count(), df_answers.count(), df_tags.count())

10000 10000 10000


In [67]:
# Inspect Schemas
df_questions.printSchema()
df_answers.printSchema()
df_tags.printSchema()

root
 |-- Id: string (nullable = true)
 |-- OwnerUserId: string (nullable = true)
 |-- CreationDate: string (nullable = true)
 |-- ClosedDate: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Body: string (nullable = true)

root
 |-- Id: string (nullable = true)
 |-- OwnerUserId: string (nullable = true)
 |-- CreationDate: string (nullable = true)
 |-- ParentId: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Body: string (nullable = true)

root
 |-- Id: string (nullable = true)
 |-- Tag: string (nullable = true)



In [68]:
# Convert datatypes and add column for finding time differences

df_quest_filt = (df_questions
                .withColumn('Id', df_questions['Id'].cast(IntegerType()))
                .withColumn('OwnerUserId', df_questions['OwnerUserId'].cast(IntegerType()))
                .withColumn('CreationDate', F.regexp_replace('CreationDate', 'T', ' '))
                .withColumn('CreationDate', F.regexp_replace('CreationDate', 'Z', ''))
                .withColumn('CreationTime', F.unix_timestamp('CreationDate', 'y-M-d HH:mm:ss').cast(LongType()))
                .withColumn('ClosedDate', F.regexp_replace('ClosedDate', 'T', ' '))
                .withColumn('ClosedDate', F.regexp_replace('ClosedDate', 'Z', ''))
                .withColumn('ClosedTime', F.unix_timestamp('ClosedDate', 'y-M-d HH:mm:ss').cast(LongType()))
                .withColumn('ElapsedTime', (F.col('ClosedTime') - F.col('CreationTime')))
                .withColumn('Score', df_questions['Score'].cast(IntegerType()))
                ).na.drop()

df_answers_filt = (df_answers
                    .withColumn('Id', df_answers['Id'].cast(IntegerType()))
                    .withColumn('OwnerUserId', df_answers['OwnerUserId'].cast(IntegerType()))
                    .withColumn('ParentId', df_answers['ParentId'].cast(IntegerType()))
                    .withColumn('Score', df_answers['Score'].cast(IntegerType()))
                    .withColumn('CreationDate', F.regexp_replace('CreationDate', 'T', ' '))
                    .withColumn('CreationDate', F.regexp_replace('CreationDate', 'Z', ''))
                    .withColumn('CreationTime', F.unix_timestamp('CreationDate', 'y-M-d HH:mm:ss').cast(LongType()))
                    ).na.drop()

df_tags_filt = (df_tags
                .withColumn('Id', df_tags['Id'].cast(IntegerType()))
                ).na.drop()        

In [69]:
# Inspect counts after dropping nulls
print(df_quest_filt.count(), df_answers_filt.count(), df_tags_filt.count())

                                                                                

562 9397 10000


Around 97% of questions are eliminated after dropping nulls! This is most likely due to nulls in question's "ClosedDate" column. This also indicates that taking the time difference between a question's CreationDate and ClosedDate will not be a good indicator. Let's try a different approach to removing nulls.

In [70]:
df_quest_filt = (df_questions
                .withColumn('Id', df_questions['Id'].cast(IntegerType()))
                .withColumn('OwnerUserId', df_questions['OwnerUserId'].cast(IntegerType()))
                .withColumn('CreationDate', F.regexp_replace('CreationDate', 'T', ' '))
                .withColumn('CreationDate', F.regexp_replace('CreationDate', 'Z', ''))
                .withColumn('CreationTime', F.unix_timestamp('CreationDate', 'y-M-d HH:mm:ss').cast(LongType()))
                .withColumn('ClosedDate', F.regexp_replace('ClosedDate', 'T', ' '))
                .withColumn('ClosedDate', F.regexp_replace('ClosedDate', 'Z', ''))
                .withColumn('ClosedTime', F.unix_timestamp('ClosedDate', 'y-M-d HH:mm:ss').cast(LongType()))
                .withColumn('Score', df_questions['Score'].cast(IntegerType()))
                ).na.drop(subset=["Id", "Body", "CreationTime"])

In [71]:
df_quest_filt.count()

1155

In [72]:
df_quest_filt.sort("Id", ascending=True).show(10)



+---+-----------+-------------------+-------------------+-----+--------------------+--------------------+------------+----------+
| Id|OwnerUserId|       CreationDate|         ClosedDate|Score|               Title|                Body|CreationTime|ClosedTime|
+---+-----------+-------------------+-------------------+-----+--------------------+--------------------+------------+----------+
| 80|         26|2008-08-01 13:57:07|                 NA|   26|SQLStatement.exec...|"<p>I've written ...|  1217613427|      null|
| 90|         58|2008-08-01 14:41:24|2012-12-26 03:45:49|  144|Good branching an...|"<p>Are there any...|  1217616084|1356511549|
|120|         83|2008-08-01 15:50:08|                 NA|   21|   ASP.NET Site Maps|<p>Has anyone got...|  1217620208|      null|
|180|    2089740|2008-08-01 18:42:19|                 NA|   53|Function for crea...|<p>This is someth...|  1217630539|      null|
|260|         91|2008-08-01 23:22:08|                 NA|   49|Adding scripting ...|<p>I h

                                                                                

This approach looks much better. Let's start exploring the data. Some potentially interesting questions:
- What are the most commmon tags?
- Can we predict if a question contains a top tag from the body text?
- Can we predict question score based on text?
- Can we predict how long a question will take to answer?
- Relationship between score and how long a question takes to answer?

### 1. Most Common Tags
What are the most common tags?

In [73]:
tags_count = (df_tags_filt
            .groupBy('Tag')
            .count()
            .sort('count', ascending=False)
            .limit(10)
            )

In [74]:
tags_count.show()

+----------+-----+
|       Tag|count|
+----------+-----+
|        c#|  399|
|      .net|  362|
|      java|  254|
|   asp.net|  225|
|       c++|  178|
|javascript|  158|
|sql-server|  141|
|       sql|  130|
|    python|  127|
|       php|  124|
+----------+-----+



Let's save these top 10 as a list.

In [75]:
df = tags_count.select('Tag').toPandas()
most_common_tags = df['Tag'].to_list()
print(most_common_tags)

['c#', '.net', 'java', 'asp.net', 'c++', 'javascript', 'sql-server', 'sql', 'python', 'php']


Now let's build a model to predict whether or not a question contains a top 10 tag. We can accomplish this by joining dataframes and adding a yes/no label for whether a question contains a such a tag.

In [76]:
df_quest_filt.show(3)



+---+-----------+-------------------+-------------------+-----+--------------------+--------------------+------------+----------+
| Id|OwnerUserId|       CreationDate|         ClosedDate|Score|               Title|                Body|CreationTime|ClosedTime|
+---+-----------+-------------------+-------------------+-----+--------------------+--------------------+------------+----------+
| 80|         26|2008-08-01 13:57:07|                 NA|   26|SQLStatement.exec...|"<p>I've written ...|  1217613427|      null|
| 90|         58|2008-08-01 14:41:24|2012-12-26 03:45:49|  144|Good branching an...|"<p>Are there any...|  1217616084|1356511549|
|120|         83|2008-08-01 15:50:08|                 NA|   21|   ASP.NET Site Maps|<p>Has anyone got...|  1217620208|      null|
+---+-----------+-------------------+-------------------+-----+--------------------+--------------------+------------+----------+
only showing top 3 rows



                                                                                

In [77]:
# Groupby id and collect all tags into list
df_id_tags = (df_tags_filt
                .groupBy('Id')
                .agg(F.collect_list('Tag')
                .alias('Tags'))
                .sort('Id', ascending=True)
)

df_id_tags.show(5)

+---+--------------------+
| Id|                Tags|
+---+--------------------+
| 80|[flex, actionscri...|
| 90|[svn, tortoisesvn...|
|120|[sql, asp.net, si...|
|180|[algorithm, langu...|
|260|[c#, .net, script...|
+---+--------------------+
only showing top 5 rows



In [78]:
df_quest_filt.show(4)



+---+-----------+-------------------+-------------------+-----+--------------------+--------------------+------------+----------+
| Id|OwnerUserId|       CreationDate|         ClosedDate|Score|               Title|                Body|CreationTime|ClosedTime|
+---+-----------+-------------------+-------------------+-----+--------------------+--------------------+------------+----------+
| 80|         26|2008-08-01 13:57:07|                 NA|   26|SQLStatement.exec...|"<p>I've written ...|  1217613427|      null|
| 90|         58|2008-08-01 14:41:24|2012-12-26 03:45:49|  144|Good branching an...|"<p>Are there any...|  1217616084|1356511549|
|120|         83|2008-08-01 15:50:08|                 NA|   21|   ASP.NET Site Maps|<p>Has anyone got...|  1217620208|      null|
|180|    2089740|2008-08-01 18:42:19|                 NA|   53|Function for crea...|<p>This is someth...|  1217630539|      null|
+---+-----------+-------------------+-------------------+-----+--------------------+------

                                                                                

In [79]:
# Join dataframes
df_body_tags = (df_quest_filt
                  .join(df_id_tags,
                        df_quest_filt['Id'] == df_id_tags['Id'],
                        'inner')
                  .select(df_quest_filt['Id'], 'Tags', 'Score', 'Body')
                  )

df_body_tags.show(5)

+---+--------------------+-----+--------------------+
| Id|                Tags|Score|                Body|
+---+--------------------+-----+--------------------+
| 80|[flex, actionscri...|   26|"<p>I've written ...|
| 90|[svn, tortoisesvn...|  144|"<p>Are there any...|
|120|[sql, asp.net, si...|   21|<p>Has anyone got...|
|180|[algorithm, langu...|   53|<p>This is someth...|
|260|[c#, .net, script...|   49|<p>I have a littl...|
+---+--------------------+-----+--------------------+
only showing top 5 rows



In [80]:
# Test writing the results to csv
(df_body_tags
    .withColumn('Tags', F.col('Tags').cast('string'))
    .write.option("header", "true")
    .mode("overwrite")
    .csv("output")
)

With the proper dataframes joined, the next step is to parse the body text. We'll use the `udf` function to accomplish this. However, as you'll see the following steps results in unknown errors associated with `udf` and failure to "open socket to Python daemon". What's more confusing is that I've previously used the same `udf` workflow successfully on a different project on a different machine.

In [81]:
from pyspark.sql.functions import udf

# barebones udf function for debugging
def parse_body(body):
    return body

parse = udf(parse_body, StringType())

df_test = (df_quest_filt
           .withColumn('body_parsed', parse('Body'))
        )

# TODO alternatives also attempted
# parse = udf(lambda x: parse_body(x), StringType())
# 
# @udf(returnType=StringType())
# def parse_body(body):
#     # html = BeautifulSoup(body)
#     return body
# 
# df_test = (df_body_tags
#                   .withColumn('ParsedBody', parse_body(F.col('Body')))
#                   )

In [82]:
df_test.show(5)

23/03/25 13:54:32 WARN PythonWorkerFactory: Failed to open socket to Python daemon:
java.net.NoRouteToHostException: Can't assign requested address (Address not available)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:476)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:218)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:200)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394)
	at java.net.Socket.connect(Socket.java:606)
	at java.net.Socket.connect(Socket.java:555)
	at java.net.Socket.<init>(Socket.java:451)
	at java.net.Socket.<init>(Socket.java:261)
	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:119)
	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:136)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFa

Py4JJavaError: An error occurred while calling o991.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 120.0 failed 1 times, most recent failure: Lost task 0.0 in stage 120.0 (TID 542) (192.168.0.14 executor driver): java.net.NoRouteToHostException: Can't assign requested address (Address not available)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:476)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:218)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:200)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394)
	at java.net.Socket.connect(Socket.java:606)
	at java.net.Socket.connect(Socket.java:555)
	at java.net.Socket.<init>(Socket.java:451)
	at java.net.Socket.<init>(Socket.java:261)
	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:119)
	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:143)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:135)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.sql.execution.python.BatchEvalPythonExec.evaluate(BatchEvalPythonExec.scala:81)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:130)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2303)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2252)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2251)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2251)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2490)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2432)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2421)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:902)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:472)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3709)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2735)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3700)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3698)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2735)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2942)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:302)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:339)
	at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.NoRouteToHostException: Can't assign requested address (Address not available)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:476)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:218)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:200)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:394)
	at java.net.Socket.connect(Socket.java:606)
	at java.net.Socket.connect(Socket.java:555)
	at java.net.Socket.<init>(Socket.java:451)
	at java.net.Socket.<init>(Socket.java:261)
	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:119)
	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:143)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:135)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.sql.execution.python.BatchEvalPythonExec.evaluate(BatchEvalPythonExec.scala:81)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:130)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


Internet searches for Spark errors with "PythonWorkerFactory: Failed to open socket to Python daemon" have come up with nothing fruitful (and are surprisingly limited). Have also tried running the code through the Spark submit job at the command line (in case Jupyter Notebook was the issue) and tried an earlier version of Spark (3.1.3), but in all cases still receive the same errors. 

Quite stumped on this one. Next step may be to try Google Colab to rule out my machine.

**Aside:** Syntax for submitting a Spark job locally is:
```bash
spark-submit main.py --questions=./assets/Questions.csv --answers=./assets/Answers.csv --tags=./assets/Tags.csv
```

In [None]:
spark.stop()