### <mark> df.show(n=20, truncate=20, vertical=False)

PySpark DataFrame show() is used to display the contents of the DataFrame in a Table Row and Column Format. By default, it shows only 20 Rows, and the column values are truncated at 20 characters.

Use PySpark show() method to display the contents of the DataFrame and use pyspark printSchema() method to print the schema.

    # Default - displays 20 rows and 20 charactes from column value 
    df.show()

    # Display full column contents
    df.show(truncate=False)

    # Display 2 rows and full column contents
    df.show(2,truncate=False) 

    # Display 2 rows & column values 25 characters
    df.show(2,truncate=25) 

    # Display DataFrame rows & columns vertically
    df.show(n=3,truncate=25,vertical=True)
    
    

### <mark> df.printSchema()

    root
     |-- firstname: string (nullable = true)
     |-- middlename: string (nullable = true)
     |-- lastname: string (nullable = true)

### <mark> from pyspark.sql.types import StructType, StructField, StringType
    
PySpark infers a schema from data, sometimes we may need to define our own column names and data types. PySpark StructType & StructField classes are used to programmatically specify the schema to the DataFrame and create complex columns like nested struct, array, and map columns. StructType is a collection of StructField’s that defines column name, column data type, boolean to specify if the field can be nullable or not and metadata. 

    from pyspark.sql.types import (
        StructType,
        StructField, 
        StringType, 
        IntegerType,
        ArrayType,
        MapType,
        )  
    
    schema = StructType([ \
        StructField("firstname",StringType(),True), \
        StructField("middlename",StringType(),True), \
      ])

#### arrayType and mapType    
    
    arrayStructureSchema = StructType([
        StructField('name', StructType([
           StructField('firstname', StringType(), True),
           StructField('middlename', StringType(), True),
           StructField('lastname', StringType(), True)
           ])),
           StructField('hobbies', ArrayType(StringType()), True),
           StructField('properties', MapType(StringType(),StringType()), True)
        ])


    root
     |-- name: struct (nullable = true)
     |    |-- firstname: string (nullable = true)
     |    |-- middlename: string (nullable = true)
     |    |-- lastname: string (nullable = true)
     |-- hobbies: array (nullable = true)
     |    |-- element: string (containsNull = true)
     |-- properties: map (nullable = true)
     |    |-- key: string
     |    |-- value: string (valueContainsNull = true)

#### nestedSchema
    
    nestedSchema = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
         StructField('salary', IntegerType(), True)
         ])

#### adding & changing struct of the DataFrame

    from pyspark.sql.functions import col,struct,when
    updatedDF = df2.withColumn("OtherInfo", 
        struct(col("id").alias("identifier"),
        col("gender").alias("gender"),
        col("salary").alias("salary"),
        when(col("salary").cast(IntegerType()) < 2000,"Low")
          .when(col("salary").cast(IntegerType()) < 4000,"Medium")
          .otherwise("High").alias("Salary_Grade")
      )).drop("id","gender","salary")

#### <mark> df.schema

#### Creating StructType object struct from JSON file

If you have too many columns and the structure of the DataFrame changes now and then, it’s a good practice to load the SQL StructType schema from JSON file. You can get the schema by using df2.schema.json() , store this in a file and will use it to create a the schema from this file.

    > print(df2.schema.json())

    > df.schema.simpleString() 
    # this will return relatively simple schema format

    > import json
    > schemaFromJson = StructType.fromJson(json.loads(schema.json))
    > name_rdd = spark.sparkContext.parallelize(name_data)
    > df3 = spark.createDataFrame(name_rdd,schemaFromJson)

### Checking if a Column Exists in a DataFrame


    > df.schema.fieldNames.contains("firstname")

    > df.schema.contains(StructField("firstname",StringType,true))
