This notebook shows how to specify schema for `pyspark.dataframe`

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('BDA').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/20 05:15:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/20 05:15:46 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/01/20 05:15:46 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/01/20 05:15:46 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


# `StructType` - defines the structure of the dataframe

The [structType](https://sparkbyexamples.com/pyspark/pyspark-structtype-and-structfield/) is a collection or list of `StructField` objects.

use `PySpark.printSchema()` to shows the `StructType` columns as struct

## `StructField` - metadata of the dataframe columns

Syntax:
```python
from pyspark.sql.types import *

StructField(name:str, column_type:DataType, nullable: Boolean)
```

[Available datatype links](https://spark.apache.org/docs/latest/sql-ref-datatypes.html)

Short list:
- StringType()
- IntegerType()
- FloatType()
- DoubleType()
- TimestampType()
- DateType()

etc

In [2]:
from pyspark.sql.types import *

data = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])

psdf = spark.createDataFrame(data, schema=schema)

Schema:

In [3]:
psdf.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



Dataframe:

In [4]:
psdf.show()

                                                                                

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|   id|gender|salary|
+---------+----------+--------+-----+------+------+
|    James|          |   Smith|36636|     M|  3000|
|  Michael|      Rose|        |40288|     M|  4000|
|   Robert|          |Williams|42114|     M|  4000|
|    Maria|      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



Alternatively, we can pass a struct string:

[stackoverflow](https://stackoverflow.com/a/71279635)

In [5]:
psdf2 = spark.createDataFrame(data, schema="firstname string, middlename string, lastname string, id string, gender string, salary long")
psdf2.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)



## Nested `StructType` structure

In the following example, we have a data structure that has a nested data, we can create a nested `StructField`:

In [6]:

structureData = [
    (("James","","Smith"),"36636","M",3100),
    (("Michael","Rose",""),"40288","M",4300),
    (("Robert","","Williams"),"42114","M",1400),
    (("Maria","Anne","Jones"),"39192","F",5500),
    (("Jen","Mary","Brown"),"","F",-1)
  ]
structureSchema = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
         StructField('id', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', IntegerType(), True)
         ])

psdf2 = spark.createDataFrame(data=structureData,schema=structureSchema)
psdf2.printSchema()
psdf2.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+--------------------+-----+------+------+
|name                |id   |gender|salary|
+--------------------+-----+------+------+
|{James, , Smith}    |36636|M     |3100  |
|{Michael, Rose, }   |40288|M     |4300  |
|{Robert, , Williams}|42114|M     |1400  |
|{Maria, Anne, Jones}|39192|F     |5500  |
|{Jen, Mary, Brown}  |     |F     |-1    |
+--------------------+-----+------+------+



## `StructType` in JSON

[example](https://sparkbyexamples.com/pyspark/pyspark-structtype-and-structfield#schema-from-json)

If the schema is large enough, you can save it as json file and load it later

In [7]:
schema_json = psdf2.schema.json()
# save it somewhere

import json

schemaFromJson = StructType.fromJson(json.loads(schema_json))
df3 = spark.createDataFrame(
        spark.sparkContext.parallelize(structureData),schemaFromJson)
df3.printSchema()

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



# Schema inference

In [8]:
lines = spark.sparkContext.textFile("../sample.csv")
type(lines)

pyspark.rdd.RDD

In [18]:
peopleRDD = lines.map(lambda e : e.split(",")) \
                 .map(lambda p : Row(age=(p[0]), name=(p[1])))
peopleDF = spark.createDataFrame(peopleRDD)
peopleDF.first()

Row(age='age', name='name')

23/01/20 20:40:56 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 723074 ms exceeds timeout 120000 ms
23/01/20 20:40:56 WARN SparkContext: Killing executors is not supported by current scheduler.


## Remove first row from dataframe

[stackoverflow](https://stackoverflow.com/a/61782141)