# Erzeugen von Dataframes

* aus einem RDD oder einer beliebigen Python-Collection
* aus anderen Dataframes
* aus Dateien/Datenbanken

Wir werden uns jede dieser Möglichkeiten anschauen.

## Initialisieren einer Sparksession

In [1]:
from pyspark.sql import SparkSession

from pyspark.sql.types import *
from pyspark.sql.functions import col

spark = (
    SparkSession
        .builder
        .appName("create-dataframes")
        .master("local[4]")  
        .config("spark.dynamicAllocation.enabled",False)
        .config("spark.sql.adaptive.enabled",False)
        .getOrCreate()
)
sc = spark.sparkContext
spark

23/09/09 09:50:02 WARN Utils: Your hostname, pupil-a resolves to a loopback address: 127.0.1.1; using 23.88.105.62 instead (on interface eth0)
23/09/09 09:50:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/09 09:50:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
from IPython.display import *
display(HTML("<style>pre { white-space: pre !important; }</style>"))

## Erzeugen aus einem RDD

In [9]:
data = [
    [1, "Oliver", 32],
    [2, "Xiaofei", 19],
    [3, "Marc", 31]
]

dozenten_rdd = sc.parallelize(data)

In [10]:
dozenten_df = dozenten_rdd.toDF()

In [11]:
type(dozenten_rdd)

pyspark.rdd.RDD

In [12]:
dozenten_df.show()

+---+-------+---+
| _1|     _2| _3|
+---+-------+---+
|  1| Oliver| 32|
|  2|Xiaofei| 19|
|  3|   Marc| 31|
+---+-------+---+



In [13]:
dozenten_df = dozenten_df.toDF("ID", "Name", "Age")

In [14]:
dozenten_df.show()

+---+-------+---+
| ID|   Name|Age|
+---+-------+---+
|  1| Oliver| 32|
|  2|Xiaofei| 19|
|  3|   Marc| 31|
+---+-------+---+



In [15]:
dozenten_df.printSchema()

root
 |-- ID: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)



## Erzeugen eines Dataframe aus einer Collection

In [18]:
dozenten_df = (
    spark.createDataFrame(
        data,
        "Id: long, Name: string, Age: int"
    )
)

In [19]:
dozenten_df.show()

+---+-------+---+
| Id|   Name|Age|
+---+-------+---+
|  1| Oliver| 32|
|  2|Xiaofei| 19|
|  3|   Marc| 31|
+---+-------+---+



In [20]:
dozenten_df.printSchema()

root
 |-- Id: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)



## Erzeugen aus einer Datei

### CSV

In [23]:
yellow_taxis_df = spark.read.csv("YellowTaxis_202210.csv.gz")

In [24]:
yellow_taxis_df.show(4)

+--------+--------------------+--------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+--------------------+------------+--------------------+-----------+
|     _c0|                 _c1|                 _c2|            _c3|          _c4|       _c5|               _c6|         _c7|         _c8|         _c9|       _c10| _c11|   _c12|      _c13|        _c14|                _c15|        _c16|                _c17|       _c18|
+--------+--------------------+--------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+--------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_date...|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls

In [32]:
yellow_taxis_df = spark.read.option("header", True).csv("YellowTaxis_202210.csv.gz")

In [33]:
yellow_taxis_df.show(4)

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       1|2022-10-01T05:33:...| 2022-10-01T05:48:...|            1.0|          1.7|       1.0|                 N|         249|         107|           1|        9.5|  3.0|    0.5|      2.6

In [34]:
yellow_taxis_df.count()

                                                                                

3675412

In [21]:
yellow_taxis_df.show(2)

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       1|2022-10-01T05:33:...| 2022-10-01T05:48:...|            1.0|          1.7|       1.0|                 N|         249|         107|           1|        9.5|  3.0|    0.5|      2.6

In [35]:
yellow_taxis_df.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- tpep_pickup_datetime: string (nullable = true)
 |-- tpep_dropoff_datetime: string (nullable = true)
 |-- passenger_count: string (nullable = true)
 |-- trip_distance: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: string (nullable = true)
 |-- extra: string (nullable = true)
 |-- mta_tax: string (nullable = true)
 |-- tip_amount: string (nullable = true)
 |-- tolls_amount: string (nullable = true)
 |-- improvement_surcharge: string (nullable = true)
 |-- total_amount: string (nullable = true)
 |-- congestion_surcharge: string (nullable = true)
 |-- airport_fee: string (nullable = true)



### TSV

In [28]:
green_taxi_df = (
    spark.read
    .option("header", "true")
    .option("delimiter", "\t")
    .csv("GreenTaxis_*.csv") # beachte den Wildcard
)

In [29]:
green_taxi_df.show(4)

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorId|lpep_pickup_datetime|lpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       2|2022-10-01T06:08:...| 2022-10-01T06:21:...|            1.0|         2.47|       1.0|                 N|         256|         225|         1.0|       11.5|  0.5|    0.5|      2.5

Eine Spalte referenziert einen PaymentType, den laden wir mal aus einem Json nach

### Json

In [42]:
%less PaymentTypes.json

{"PaymentTypeID":1,"PaymentType":"Credit Card"}
{"PaymentTypeID":2,"PaymentType":"Cash"}
{"PaymentTypeID":3,"PaymentType":"No Charge"}
{"PaymentTypeID":4,"PaymentType":"Dispute"}
{"PaymentTypeID":5,"PaymentType":"Unknown"}
{"PaymentTypeID":6,"PaymentType":"Voided Trip"}


In [32]:
payment_types_df = (
    spark.read.json("PaymentTypes.json") # es gibt auch spark.read.parquet, spark.read.jdbc, ...
)

In [31]:
payment_types_df.show()

+-----------+-------------+
|PaymentType|PaymentTypeID|
+-----------+-------------+
|Credit Card|            1|
|       Cash|            2|
|  No Charge|            3|
|    Dispute|            4|
|    Unknown|            5|
|Voided Trip|            6|
+-----------+-------------+

