# Schemata anwenden

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = (SparkSession
         .builder
         .appName("schemata")
         .master("local[4]")
         .config("spark.dynamicAllocation.enabled", "false")
         .config("spark.sql.adaptive.enabled", "false")
         .getOrCreate()
        )
sc = spark.sparkContext
spark

23/09/10 08:56:27 WARN Utils: Your hostname, keen-northcutt resolves to a loopback address: 127.0.1.1; using 116.203.107.225 instead (on interface eth0)
23/09/10 08:56:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/10 08:56:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/10 08:56:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/09/10 08:56:29 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [5]:
from IPython.display import *
display(HTML("<style>pre { white-space: pre !important; }</style>"))

## Schemalos

In [6]:
yellow_taxi_df = spark.read.option("header", True).csv("YellowTaxis_202210.csv.gz")
yellow_taxi_df.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- tpep_pickup_datetime: string (nullable = true)
 |-- tpep_dropoff_datetime: string (nullable = true)
 |-- passenger_count: string (nullable = true)
 |-- trip_distance: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: string (nullable = true)
 |-- extra: string (nullable = true)
 |-- mta_tax: string (nullable = true)
 |-- tip_amount: string (nullable = true)
 |-- tolls_amount: string (nullable = true)
 |-- improvement_surcharge: string (nullable = true)
 |-- total_amount: string (nullable = true)
 |-- congestion_surcharge: string (nullable = true)
 |-- airport_fee: string (nullable = true)



* In Spark UI checken, dass ein Job erzeugt wurde. 
* Aber eigentlich wird Lesen doch lazy ausgeführt?
* Wieso dann ein Job?
* Was zum SQL/DataFrame Tab in Spark UI sagen
* Zeigen dass die Datentypen noch nicht korrekt sind.


## Automatisches Schema

In [7]:
yellow_taxi_df = spark.read.option("header", True).option("inferSchema", True).csv("YellowTaxis_202210.csv.gz")
yellow_taxi_df.printSchema()

[Stage 2:>                                                          (0 + 1) / 1]

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)



                                                                                

### Vor- und Nachteile des Automatischen erkennen von Schemata

#### Pro

* gut während der Entwicklung um sich mit den Daten vertraut zu machen
* kein Aufwand für das Schreiben von Schemata notwendig

#### Cons

* Fehler in den Daten führen zu falschem Schema
* dauert lange

#### Deswegen

Für produktive Spark Applikationen Schemata besser manuell angeben


## Manuelle Schemata

In [8]:
%less schemata.py

[0;32mfrom[0m [0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mtypes[0m [0;32mimport[0m [0;34m*[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0myellow_taxi_schema[0m  [0;34m=[0m  [0;34m([0m  [0mStructType[0m[0;34m[0m
[0;34m[0m                        [0;34m([0m[0;34m[[0m [0;34m[0m
[0;34m[0m                            [0mStructField[0m[0;34m([0m[0;34m"VendorId"[0m               [0;34m,[0m [0mIntegerType[0m[0;34m([0m[0;34m)[0m   [0;34m,[0m [0;32mTrue[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m                            [0mStructField[0m[0;34m([0m[0;34m"tpep_pickup_datetime"[0m   [0;34m,[0m [0mTimestampType[0m[0;34m([0m[0;34m)[0m [0;34m,[0m [0;32mTrue[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m                            [0mStructField[0m[0;34m([0m[0;34m"tpep_dropoff_datetime"[0m  [0;34m,[0m [0mTimestampType[0m[0;34m([0m[0;34m)[0m [0;34m,[0m [0;32mTrue[0m[0;34m)[0m[0;34m,[0m[0;34m[0m

In [9]:
import schemata
yellow_taxi_df = (
    spark
        .read
        .option("header", True)
        .schema(schemata.yellow_taxi_schema)
        .csv("YellowTaxis_202210.csv.gz")
)

In [10]:
yellow_taxi_df.printSchema()

root
 |-- VendorId: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)



In der Spark UI validieren, dass wirklich kein Job dafür gestartet wurde.

### Nun mit einem Json File

In [11]:
taxi_bases_df = spark.read.json("TaxiBases.json")

In [12]:
taxi_bases_df.show()

AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

In [None]:
taxi_bases_df = spark.read.option("multiline", True).json("TaxiBases.json")

In [None]:
taxi_bases_df.show(3, truncate=False)

In [None]:
taxi_bases_df.printSchema()

In [None]:
taxi_bases_df = spark.read.option("multiline", True).schema(schemata.taxi_bases_schema).json("TaxiBases.json")
taxi_bases_df.printSchema()
taxi_bases_df.show(truncate=False)

Was nimmst Du aus dieser Lektion mit?

Du kannst ein Schema automatisch erkennen, aber das dauert teils lange und kann auch zu nicht korrekten Ergebnissen führen. Deswegen ist es in der Praxis empfehleungswert Schemata manuell zu erzeugen und anzugeben.
