# **Arxiv metadata Analytics with PySpark DF: JSON case study**<a href="#Arxiv-metadata-Analytics-with-PySpark-DF:-JSON-case-study" class="anchor-link">¶</a>

### Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark<a href="#Udemy-Course:-Best-Hands-on-Big-Data-Practices-and-Use-Cases-using-PySpark" class="anchor-link">¶</a>

### Author: Amin Karami (PhD, FHEA)<a href="#Author:-Amin-Karami-(PhD,-FHEA)" class="anchor-link">¶</a>

In \[ \]:

    ########## ONLY in Colab ##########
    !pip3 install pyspark
    ########## ONLY in Colab ##########

In \[1\]:

    ########## ONLY in Ubuntu Machine ##########
    # Load Spark engine
    !pip3 install -q findspark
    import findspark
    findspark.init()
    ########## ONLY in Ubuntu Machine ##########

In \[2\]:

    # import SparkSession

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.master("local[*]").getOrCreate()

    spark

Out\[2\]:

**SparkSession - in-memory**

**SparkContext**

[Spark UI](http://192.168.229.131:4040)

Version  
`v3.0.0`

Master  
`local[*]`

AppName  
`pyspark-shell`

In \[3\]:

    # Read and Load Data to Spark
    # Data source: https://www.kaggle.com/Cornell-University/arxiv/version/62

    df = spark.read.json("data/arxiv-metadata-oai-snapshot.json")
    df.printSchema()

    root
     |-- abstract: string (nullable = true)
     |-- authors: string (nullable = true)
     |-- authors_parsed: array (nullable = true)
     |    |-- element: array (containsNull = true)
     |    |    |-- element: string (containsNull = true)
     |-- categories: string (nullable = true)
     |-- comments: string (nullable = true)
     |-- doi: string (nullable = true)
     |-- id: string (nullable = true)
     |-- journal-ref: string (nullable = true)
     |-- license: string (nullable = true)
     |-- report-no: string (nullable = true)
     |-- submitter: string (nullable = true)
     |-- title: string (nullable = true)
     |-- update_date: string (nullable = true)
     |-- versions: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- created: string (nullable = true)
     |    |    |-- version: string (nullable = true)

In \[5\]:

    # check the partitions
    print(df.rdd.getNumPartitions())

    25

## Question 1: Create a new Schema<a href="#Question-1:-Create-a-new-Schema" class="anchor-link">¶</a>

In \[6\]:

    from pyspark.sql.types import *

    # Define Schema
    Schema = StructType([
                        StructField('authors', StringType(), True),
                        StructField('categories', StringType(), True),
                        StructField('license', StringType(), True),
                        StructField('comments', StringType(), True),
                        StructField('abstract', StringType(), True),
                        StructField('versions', ArrayType(StringType()), True),
    ])

    print(Schema)

    StructType(List(StructField(authors,StringType,true),StructField(categories,StringType,true),StructField(license,StringType,true),StructField(comments,StringType,true),StructField(abstract,StringType,true),StructField(versions,ArrayType(StringType,true),true)))

## Question 2: Binding Data to a Schema<a href="#Question-2:-Binding-Data-to-a-Schema" class="anchor-link">¶</a>

In \[7\]:

    df = spark.read.json("data/arxiv-metadata-oai-snapshot.json", schema=Schema)

    df.show()

    +--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
    |             authors|       categories|             license|            comments|            abstract|            versions|
    +--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
    |C. Bal\'azs, E. L...|           hep-ph|                null|37 pages, 15 figu...|  A fully differe...|[{"version":"v1",...|
    |Ileana Streinu an...|    math.CO cs.CG|http://arxiv.org/...|To appear in Grap...|  We describe a n...|[{"version":"v1",...|
    |         Hongjun Pan|   physics.gen-ph|                null| 23 pages, 3 figures|  The evolution o...|[{"version":"v1",...|
    |        David Callan|          math.CO|                null|            11 pages|  We show that a ...|[{"version":"v1",...|
    |Wael Abu-Shammala...|  math.CA math.FA|                null|                null|  In this paper w...|[{"version":"v1",...|
    |Y. H. Pong and C....|cond-mat.mes-hall|                null|6 pages, 4 figure...|  We study the tw...|[{"version":"v1",...|
    |Alejandro Corichi...|            gr-qc|                null|16 pages, no figu...|  A rather non-st...|[{"version":"v1",...|
    |     Damian C. Swift|cond-mat.mtrl-sci|http://arxiv.org/...|   Minor corrections|  A general formu...|[{"version":"v1",...|
    |Paul Harvey, Brun...|         astro-ph|                null|                null|  We discuss the ...|[{"version":"v1",...|
    |  Sergei Ovchinnikov|          math.CO|                null|36 pages, 17 figures|  Partial cubes a...|[{"version":"v1",...|
    |Clifton Cunningha...|  math.NT math.AG|http://arxiv.org/...|14 pages; title c...|  In this paper w...|[{"version":"v1",...|
    |         Dohoon Choi|          math.NT|                null|                null|  Recently, Bruin...|[{"version":"v1",...|
    |Dohoon Choi and Y...|          math.NT|                null|                null|  Serre obtained ...|[{"version":"v1",...|
    |        Koichi Fujii|  math.CA math.AT|                null|  18 pages, 1 figure|  In this article...|[{"version":"v1",...|
    |     Christian Stahn|           hep-th|                null|22 pages; signs a...|  The pure spinor...|[{"version":"v1",...|
    |Chao-Hsi Chang, T...|           hep-ph|                null|17 pages, 3 figur...|  In this work, w...|[{"version":"v1",...|
    |Nceba Mhlahlo, Da...|         astro-ph|                null|10 pages, 11 figu...|  Results from sp...|[{"version":"v1",...|
    |  Andreas Gustavsson|           hep-th|                null|20 pages, v2: an ...|  We give a presc...|[{"version":"v1",...|
    |         Norio Konno|  math.PR math.AG|                null|6 pages, Journal-...|  In this note we...|[{"version":"v1",...|
    |The BABAR Collabo...|           hep-ex|                null|21 pages, 13 post...|  The shape of th...|[{"version":"v1",...|
    +--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
    only showing top 20 rows

## Question 3: Missing values for comments and license attribute<a href="#Question-3:-Missing-values-for-comments-and-license-attribute" class="anchor-link">¶</a>

In \[18\]:

    # drop
    df = df.dropna(subset = ["comments"])

    # replace
    df = df.fillna(value = "unknown", subset = ["license"])

    df.show()

    +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
    |             authors|          categories|             license|            comments|            abstract|            versions|
    +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
    |C. Bal\'azs, E. L...|              hep-ph|             unknown|37 pages, 15 figu...|  A fully differe...|[{"version":"v1",...|
    |Ileana Streinu an...|       math.CO cs.CG|http://arxiv.org/...|To appear in Grap...|  We describe a n...|[{"version":"v1",...|
    |         Hongjun Pan|      physics.gen-ph|             unknown| 23 pages, 3 figures|  The evolution o...|[{"version":"v1",...|
    |        David Callan|             math.CO|             unknown|            11 pages|  We show that a ...|[{"version":"v1",...|
    |Y. H. Pong and C....|   cond-mat.mes-hall|             unknown|6 pages, 4 figure...|  We study the tw...|[{"version":"v1",...|
    |Alejandro Corichi...|               gr-qc|             unknown|16 pages, no figu...|  A rather non-st...|[{"version":"v1",...|
    |     Damian C. Swift|   cond-mat.mtrl-sci|http://arxiv.org/...|   Minor corrections|  A general formu...|[{"version":"v1",...|
    |  Sergei Ovchinnikov|             math.CO|             unknown|36 pages, 17 figures|  Partial cubes a...|[{"version":"v1",...|
    |Clifton Cunningha...|     math.NT math.AG|http://arxiv.org/...|14 pages; title c...|  In this paper w...|[{"version":"v1",...|
    |        Koichi Fujii|     math.CA math.AT|             unknown|  18 pages, 1 figure|  In this article...|[{"version":"v1",...|
    |     Christian Stahn|              hep-th|             unknown|22 pages; signs a...|  The pure spinor...|[{"version":"v1",...|
    |Chao-Hsi Chang, T...|              hep-ph|             unknown|17 pages, 3 figur...|  In this work, w...|[{"version":"v1",...|
    |Nceba Mhlahlo, Da...|            astro-ph|             unknown|10 pages, 11 figu...|  Results from sp...|[{"version":"v1",...|
    |  Andreas Gustavsson|              hep-th|             unknown|20 pages, v2: an ...|  We give a presc...|[{"version":"v1",...|
    |         Norio Konno|     math.PR math.AG|             unknown|6 pages, Journal-...|  In this note we...|[{"version":"v1",...|
    |The BABAR Collabo...|              hep-ex|             unknown|21 pages, 13 post...|  The shape of th...|[{"version":"v1",...|
    |Vanessa Casagrand...|nlin.PS physics.c...|             unknown|  5 pages, 4 figures|  Spatiotemporal ...|[{"version":"v1",...|
    |Simon J.A. Malham...|             math.NA|             unknown| 20 pages, 4 figures|  We present Lie ...|[{"version":"v1",...|
    |M. A. Loukitcheva...|            astro-ph|             unknown|4 pages, 2 figure...|  The very nature...|[{"version":"v1",...|
    |A.A. Serga, M. Ko...|             nlin.PS|             unknown|First appeared in...|  The formation o...|[{"version":"v1",...|
    +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
    only showing top 20 rows

## Question 4: Get the author names published paper in a 'math' category<a href="#Question-4:-Get-the-author-names-published-paper-in-a-&#39;math&#39;-category" class="anchor-link">¶</a>

In \[34\]:

    # register DF to be used in SQL
    df.createOrReplaceTempView("Archive")

    sql_query = """ SELECT authors FROM Archive
                    WHERE categories LIKE 'math%'
                """

    spark.sql(sql_query).show()

    print(spark.sql(sql_query).count())

    +--------------------+
    |             authors|
    +--------------------+
    |Ileana Streinu an...|
    |        David Callan|
    |  Sergei Ovchinnikov|
    |Clifton Cunningha...|
    |        Koichi Fujii|
    |         Norio Konno|
    |Simon J.A. Malham...|
    |Robert P. C. de M...|
    |  P\'eter E. Frenkel|
    |          Mihai Popa|
    |   Debashish Goswami|
    |      Mikkel {\O}bro|
    |Nabil L. Youssef,...|
    |         Boris Rubin|
    |         A. I. Molev|
    | Branko J. Malesevic|
    |   John W. Robertson|
    |     Yu.N. Kosovtsov|
    |        Osamu Fujino|
    |Stephen C. Power ...|
    +--------------------+
    only showing top 20 rows

    304590

## Question 5: Get linceses with 5 or more letters in the abstract<a href="#Question-5:-Get-linceses-with-5-or-more-letters-in-the-abstract" class="anchor-link">¶</a>

In \[27\]:

    sql_query = """ SELECT distinct(license) FROM Archive
                    WHERE abstract REGEXP '%\(([A-Za-z][^_ /\\<>]{5,})\)%'
                """
    spark.sql(sql_query).show()

    +--------------------+
    |             license|
    +--------------------+
    |http://creativeco...|
    |http://creativeco...|
    |http://arxiv.org/...|
    |             unknown|
    |http://creativeco...|
    +--------------------+

## Question 6: Extract the statistic of the number of pages for unknown licenses<a href="#Question-6:-Extract-the-statistic-of-the-number-of-pages-for-unknown-licenses" class="anchor-link">¶</a>

In \[48\]:

    import re
    def get_Page(line):
        search = re.findall('\d+ pages', line)
        if search:
            return int(search[0].split(" ")[0])
        else:
            return 0

        
    spark.udf.register("PageNumbers", get_Page)

    sql_query = """SELECT AVG(PageNumbers(comments)) AS avg, SUM(PageNumbers(comments)) AS sum,
                    STD(PageNumbers(comments)) AS std
                    FROM Archive
                    WHERE license="unknown"
                """

    spark.sql(sql_query).show()

    +------------------+---------+------------------+
    |               avg|      sum|               std|
    +------------------+---------+------------------+
    |13.368011068572079|5642584.0|16.777518213632323|
    +------------------+---------+------------------+