# **Arxiv metadata Analytics with PySpark DF: JSON case study**

### Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark

### Author: Amin Karami (PhD, FHEA)
#### email: amin.karami@ymail.com

In [None]:
########## ONLY in Colab ##########
# !pip3 install pyspark
########## ONLY in Colab ##########

In [None]:
########## ONLY in Ubuntu Machine ##########
# Load Spark engine
# !pip3 install -q findspark
# import findspark
# findspark.init()
########## ONLY in Ubuntu Machine ##########

In [1]:
# import SparkSession
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()

spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/03 22:39:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# Read and Load Data to Spark
df = spark.read.json("data/arxiv-metadata-oai-snapshot.json")

df.printSchema()

                                                                                

root
 |-- abstract: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- authors_parsed: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- categories: string (nullable = true)
 |-- comments: string (nullable = true)
 |-- doi: string (nullable = true)
 |-- id: string (nullable = true)
 |-- journal-ref: string (nullable = true)
 |-- license: string (nullable = true)
 |-- report-no: string (nullable = true)
 |-- submitter: string (nullable = true)
 |-- title: string (nullable = true)
 |-- update_date: string (nullable = true)
 |-- versions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created: string (nullable = true)
 |    |    |-- version: string (nullable = true)



In [3]:
# check the partitions
df.rdd.getNumPartitions()

25

## Question 1: Create a new Schema

In [5]:
from pyspark.sql.types import *

# define schema
Schema = StructType([
                StructField("authors", StringType(), True),
                StructField("categories", StringType(), True),
                StructField("license", StringType(), True),
                StructField("comments", StringType(), True),
                StructField("abstract", StringType(), True),
                StructField("versions", ArrayType(StringType()), True),
])

print(Schema)


StructType([StructField('authors', StringType(), True), StructField('categories', StringType(), True), StructField('license', StringType(), True), StructField('comments', StringType(), True), StructField('abstract', StringType(), True), StructField('versions', ArrayType(StringType(), True), True)])


## Question 2: Binding Data to a Schema

In [6]:
df = spark.read.json("data/arxiv-metadata-oai-snapshot.json", schema=Schema)

df.show()

+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|             authors|       categories|             license|            comments|            abstract|            versions|
+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|C. Bal\'azs, E. L...|           hep-ph|                null|37 pages, 15 figu...|  A fully differe...|[{"version":"v1",...|
|Ileana Streinu an...|    math.CO cs.CG|http://arxiv.org/...|To appear in Grap...|  We describe a n...|[{"version":"v1",...|
|         Hongjun Pan|   physics.gen-ph|                null| 23 pages, 3 figures|  The evolution o...|[{"version":"v1",...|
|        David Callan|          math.CO|                null|            11 pages|  We show that a ...|[{"version":"v1",...|
|Wael Abu-Shammala...|  math.CA math.FA|                null|                null|  In this paper w...|[{"version":"v1",...|


## Question 3: Missing values for "comments" and "license" attributes

In [7]:
# drop
df.dropna(subset=["comments"])

# replace
df = df.fillna(value="Unknown", subset=["license"])

df.show()

+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|             authors|       categories|             license|            comments|            abstract|            versions|
+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|C. Bal\'azs, E. L...|           hep-ph|             Unknown|37 pages, 15 figu...|  A fully differe...|[{"version":"v1",...|
|Ileana Streinu an...|    math.CO cs.CG|http://arxiv.org/...|To appear in Grap...|  We describe a n...|[{"version":"v1",...|
|         Hongjun Pan|   physics.gen-ph|             Unknown| 23 pages, 3 figures|  The evolution o...|[{"version":"v1",...|
|        David Callan|          math.CO|             Unknown|            11 pages|  We show that a ...|[{"version":"v1",...|
|Wael Abu-Shammala...|  math.CA math.FA|             Unknown|                null|  In this paper w...|[{"version":"v1",...|


## Question 4: Get the author names who published a paper in a 'math' category

In [9]:
# Register DF to be used in SparkSQL

df.createOrReplaceTempView("Archive")

sql_query = """
SELECT authors
from Archive
WHERE categories LIKE 'math%'
"""

spark.sql(sql_query).show()
print(spark.sql(sql_query).count())

+--------------------+
|             authors|
+--------------------+
|Ileana Streinu an...|
|        David Callan|
|Wael Abu-Shammala...|
|  Sergei Ovchinnikov|
|Clifton Cunningha...|
|         Dohoon Choi|
|Dohoon Choi and Y...|
|        Koichi Fujii|
|         Norio Konno|
|Simon J.A. Malham...|
|Robert P. C. de M...|
|  P\'eter E. Frenkel|
|          Mihai Popa|
|   Debashish Goswami|
|      Mikkel {\O}bro|
|Nabil L. Youssef,...|
|Wael Abu-Shammala...|
|         Boris Rubin|
|         A. I. Molev|
| Branko J. Malesevic|
+--------------------+
only showing top 20 rows





438941


                                                                                

## Question 5: Get linceses with 5 or more letters in the "abstract"

In [10]:
sql_query = """
SELECT DISTINCT(license)
FROM Archive
WHERE abstract REGEXP '%\(([a-zA-Z][^_ /\\<>]{5,})\)%'
"""

spark.sql(sql_query).show()




+--------------------+
|             license|
+--------------------+
|http://arxiv.org/...|
|http://creativeco...|
|http://creativeco...|
|http://creativeco...|
|http://creativeco...|
|             Unknown|
+--------------------+



                                                                                

## Question 6: Extract the statistic of the number of pages for unknown licenses

In [15]:
import re 

def get_page(line):
    search_found = re.findall("\d+ pages", str(line))
    if search_found:
        return search_found[0].split(" ")[0]
    else:
        return 0

spark.udf.register("PageNumbers", get_page)

sql_query = """
    SELECT
        AVG(PageNumbers(comments)) AS avg,
        SUM(PageNumbers(comments)) AS sum,
        STD(PageNumbers(comments)) AS std
    FROM Archive
    WHERE license = 'Unknown'
"""

spark.sql(sql_query).show()

23/08/03 23:14:07 WARN SimpleFunctionRegistry: The function pagenumbers replaced a previously registered function.

+------------------+---------+------------------+
|               avg|      sum|               std|
+------------------+---------+------------------+
|12.458538028610604|5642584.0|16.542834618010147|
+------------------+---------+------------------+



                                                                                

23/08/04 00:35:11 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 1801494 ms exceeds timeout 120000 ms
23/08/04 00:35:11 WARN SparkContext: Killing executors is not supported by current scheduler.
23/08/04 00:35:15 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:641)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1111)
	at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:244)
	at s