# Analysieren und Bereinigen von Daten

Was kann alles schief laufen?

* Komplettheit
  Wir können entsprechende Einträge entfernen oder die Werte angemessen auffüllen

* Einzigartigkeit
  Entfernen von Duplikaten

* Aktualität
  Zeiträume eingrenzen

* Akuratheit
  Entfernen von korrupten Daten
  

## Analysieren der Daten

In [4]:
import findspark
findspark.init()
findspark.find()


from pyspark.sql import SparkSession
from pyspark.sql.functions import * 
from pyspark.sql.types import *

import schemata

spark = (
    SparkSession
    .builder
    .config("spark.dynamicAllocation.enabled", "false")
    .config("spark.sql.adaptive.enabled", "false")
    .appName("analyse")
    .master("local[4]")
    .getOrCreate()
)
sc = spark.sparkContext

from IPython.display import *
display(HTML("<style>pre { white-space: pre !important; }</style>"))

spark

23/09/02 11:07:07 WARN Utils: Your hostname, pupil-a resolves to a loopback address: 127.0.1.1; using 167.235.141.210 instead (on interface eth0)
23/09/02 11:07:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/02 11:07:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
yellow_taxi_df = spark.read.option("header", True).schema(schemata.yellow_taxi_schema).csv("YellowTaxis_202210.csv.gz")
yellow_taxi_df.printSchema()

root
 |-- VendorId: integer (nullable = true)
 |-- lpep_pickup_datetime: timestamp (nullable = true)
 |-- lpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)



In [6]:
yellow_taxi_analyzed_df = yellow_taxi_df.describe("passenger_count", "trip_distance")
yellow_taxi_analyzed_df.show()

[Stage 0:>                                                          (0 + 1) / 1]

+-------+------------------+-----------------+
|summary|   passenger_count|    trip_distance|
+-------+------------------+-----------------+
|  count|           3542392|          3675412|
|   mean|1.3846934500755421|6.206976298167039|
| stddev|0.9302303297406955|640.8236808320255|
|    min|               0.0|              0.0|
|    max|               9.0|        389678.46|
+-------+------------------+-----------------+



                                                                                

Wir können hier schon ein paar wahrscheinliche Datenfehler erkennen.
0 Passagiere sind wohl ein Fehler, 9 Gäste ist verboten und wohl auch ein Fehler



### Weg filtern nicht akkurater Daten

In [7]:
yellow_taxi_df.rdd.repartition(4)
print("Before: " + str(yellow_taxi_df.count()))
yellow_taxi_df = yellow_taxi_df.where("passenger_count > 0").filter(col("trip_distance") > 0.0)
print("After: " + str(yellow_taxi_df.count()))

                                                                                

Before: 3675412


[Stage 4:>                                                          (0 + 1) / 1]

After: 3422296


                                                                                

In [8]:
yellow_taxi_df.rdd.getNumPartitions()

1

## Nun Zeilen mit Null filtern


In [9]:
print("Vor dem Filtern" + str(yellow_taxi_df.count()))

yellow_taxi_df = yellow_taxi_df.na.drop("all")

print("After operation = " + str(yellow_taxi_df.count()))


                                                                                

Vor dem Filtern3422296


23/09/02 11:08:44 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee
 Schema: VendorId, lpep_pickup_datetime, lpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee
Expected: lpep_pickup_datetime but found: tpep_pickup_datetime
CSV file: file:///home/pupil/spark-course/course/02-Dataframes/YellowTaxis_202210.csv.gz
[Stage 8:>                                                          (0 + 1) / 1]

After operation = 3422296


                                                                                

## Nun Nullwerte mit anderen Werten auffüllen


In [11]:
default_value_map = {'payment_type': 5, 'RateCodeId': 1}

yellow_taxi_df = yellow_taxi_df.na.fill(default_value_map)

## Duplikate entfernen


In [12]:
print("Vor der Operation: " + str(yellow_taxi_df.count()))


# you can also specify column-names for identifying duplicates
yellow_taxi_without_dup_df = yellow_taxi_df.drop_duplicates()

print("Nach der Operation: " + str(yellow_taxi_without_dup_df.count()))

23/09/02 11:09:21 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee
 Schema: VendorId, lpep_pickup_datetime, lpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee
Expected: lpep_pickup_datetime but found: tpep_pickup_datetime
CSV file: file:///home/pupil/spark-course/course/02-Dataframes/YellowTaxis_202210.csv.gz
                                                                                

Vor der Operation: 3422296


23/09/02 11:09:57 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee
 Schema: VendorId, lpep_pickup_datetime, lpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee
Expected: lpep_pickup_datetime but found: tpep_pickup_datetime
CSV file: file:///home/pupil/spark-course/course/02-Dataframes/YellowTaxis_202210.csv.gz

Nach der Operation: 3422295


                                                                                

## Beschränke Zeiträume

In [13]:
yellow_taxi_df.printSchema()

root
 |-- VendorId: integer (nullable = true)
 |-- lpep_pickup_datetime: timestamp (nullable = true)
 |-- lpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = false)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: integer (nullable = false)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)



In [14]:
print("Vor dem Filtern : " + str(yellow_taxi_df.count()))

yellow_taxi_df = yellow_taxi_df.where("tpep_pickup_datetime >= '2022-10-01' AND tpep_dropoff_datetime < '2022-11-01'")

print ("Nach dem Filtern : " + str(yellow_taxi_df.count()))

                                                                                

Vor dem Filtern : 3421415




Nach dem Filtern : 3421415


                                                                                

## Alles zusammen machen


In [15]:
default_value_map = {'payment_type': 5, 'RateCodeId': 1}

yellow_taxi_df = spark.read.option("header", True).schema(schemata.yellow_taxi_schema).csv("YellowTaxis_202210.csv.gz")

print("Vor dem Filtern: " + str(yellow_taxi_df.count()))

yellow_taxi_df = (
    yellow_taxi_df
    .where("passenger_count > 0")
    .filter(col("trip_distance") > 0.0)
    .na.drop("all")
    .na.fill(default_value_map)
    .drop_duplicates()
    .where("tpep_pickup_datetime >= '2022-10-01' AND tpep_dropoff_datetime < '2022-11-01'")
)

print("Nach dem Filtern: + " + str(yellow_taxi_df.count()))


                                                                                

Vor dem Filtern: 3675412




Nach dem Filtern: + 3421415


                                                                                

# Transformieren

## Spalten limitieren


In [16]:
yellow_taxi_df.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = false)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: integer (nullable = false)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)



In [18]:
yellow_taxi_df = (
    yellow_taxi_df.select(
        "VendorID",
        col("passenger_count").cast(IntegerType()),
        column("trip_distance").alias("TripDistance"),
        yellow_taxi_df.tpep_pickup_datetime,
        "tpep_dropoff_datetime",
        "PULocationID",
        "DOLocationID",
        "RatecodeID",
        "total_amount",
        "payment_type"
        )
)
yellow_taxi_df.printSchema()

NameError: name 'IntegerType' is not defined

das schränkt auf die größe des Dataframes ein und macht die Verarbeitung schneller


### Spalten umbenennen

### Abgeleitete Spalten erzeugen

### Ausführpläne


In [41]:
yellow_taxi_df.explain(mode="extended")

== Parsed Logical Plan ==
'Filter (('tpep_pickup_datetime >= 2022-10-01) AND ('tpep_dropoff_datetime < 2022-11-01))
+- Deduplicate [DOLocationID#2058, improvement_surcharge#2065, tpep_dropoff_datetime#2052, PULocationID#2057, trip_distance#2054, tolls_amount#2064, RatecodeID#2151, VendorID#2050, tip_amount#2063, payment_type#2152, fare_amount#2060, passenger_count#2053, store_and_fwd_flag#2056, extra#2061, airport_fee#2068, congestion_surcharge#2067, total_amount#2066, tpep_pickup_datetime#2051, mta_tax#2062]
   +- Project [VendorID#2050, tpep_pickup_datetime#2051, tpep_dropoff_datetime#2052, passenger_count#2053, trip_distance#2054, coalesce(nanvl(RatecodeID#2055, cast(null as double)), cast(1 as double)) AS RatecodeID#2151, store_and_fwd_flag#2056, PULocationID#2057, DOLocationID#2058, coalesce(payment_type#2059, cast(5 as int)) AS payment_type#2152, fare_amount#2060, extra#2061, mta_tax#2062, tip_amount#2063, tolls_amount#2064, improvement_surcharge#2065, total_amount#2066, congesti

## Zugriffe auf nested Json Content

In [42]:
import schemata

foo


### Aggregationen

In [None]:
yel