# Introduction to the DataFrame API

In this section, we will introduce the [DataFrame and Dataset APIs](https://spark.apache.org/docs/latest/sql-programming-guide.html).

We will use a small subset from the [Record Linkage Comparison Data Set](https://archive.ics.uci.edu/ml/datasets/record+linkage+comparison+patterns), borrowed from UC Irvine Machine Learning Repository. It consists of several CSV files with match scores for patients in a Germany hospital, but we will use only one of them for the sake of simplicity. Please consult {cite:p}`schmidtmann2009evaluation` and {cite:p}`sariyar2011controlling` for more details regarding the data sets and research. 

## Setup
- Setup a `SparkSession` to work with the Dataset and DataFrame API
- Unzip the `scores.zip` file located under `data` folder.

In [1]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("intro-to-df").setMaster("local")
sc = SparkContext(conf=conf)
# Avoid polluting the console with warning messages
sc.setLogLevel("ERROR")



Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/07 17:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Create a SparkSession to work with the DataFrame API

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession(sc)

In [3]:
help(SparkSession)

Help on class SparkSession in module pyspark.sql.session:

class SparkSession(pyspark.sql.pandas.conversion.SparkConversionMixin)
 |  SparkSession(sparkContext, jsparkSession=None)
 |  
 |  The entry point to programming Spark with the Dataset and DataFrame API.
 |  
 |  A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
 |  tables, execute SQL over tables, cache tables, and read parquet files.
 |  To create a :class:`SparkSession`, use the following builder pattern:
 |  
 |  .. autoattribute:: builder
 |     :annotation:
 |  
 |  Examples
 |  --------
 |  >>> spark = SparkSession.builder \
 |  ...     .master("local") \
 |  ...     .appName("Word Count") \
 |  ...     .config("spark.some.config.option", "some-value") \
 |  ...     .getOrCreate()
 |  
 |  >>> from datetime import datetime
 |  >>> from pyspark.sql import Row
 |  >>> spark = SparkSession(sc)
 |  >>> allTypes = sc.parallelize([Row(i=1, s="string", d=1.0, l=1,
 |  ...     b=True, list=[1, 

### Unzip the scores file, if it was not done already

In [4]:
from os import path
scores_zip = path.join("data", "scores.zip")
scores_csv = path.join("data", "scores.csv")

%set_env SCORES_ZIP=$scores_zip
%set_env SCORES_CSV=$scores_csv

env: SCORES_ZIP=data/scores.zip
env: SCORES_CSV=data/scores.csv


In [5]:
%%bash
command -v unzip >/dev/null 2>&1 || { echo >&2 "unzip command is not installed. Aborting."; exit 1; }
[[ -f "$SCORES_CSV" ]] && { echo "file data/$SCORES_CSV already exist. Skipping."; exit 0; }

[[ -f "$SCORES_ZIP" ]] || { echo "file data/$SCORES_ZIP does not exist. Aborting."; exit 1; }

echo "Unzip file $SCORES_ZIP"
unzip "$SCORES_ZIP" -d data

Unzip file data/scores.zip


Archive:  data/scores.zip


  inflating: data/scores.csv         


  inflating: data/__MACOSX/._scores.csv  


In [6]:
! head "$SCORES_CSV"

"id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match"
37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE
39086,47614,1,?,1,?,1,1,1,1,1,TRUE
70031,70237,1,?,1,?,1,1,1,1,1,TRUE
84795,97439,1,?,1,?,1,1,1,1,1,TRUE
36950,42116,1,?,1,1,1,1,1,1,1,TRUE
42413,48491,1,?,1,?,1,1,1,1,1,TRUE
25965,64753,1,?,1,?,1,1,1,1,1,TRUE
49451,90407,1,?,1,?,1,1,1,1,0,TRUE
39932,40902,1,?,1,?,1,1,1,1,1,TRUE


## Loading the Scores CSV file into a DataFrame

We are going to use the Reader API

In [7]:
help(spark.read)

Help on DataFrameReader in module pyspark.sql.readwriter object:

class DataFrameReader(OptionUtils)
 |  DataFrameReader(spark)
 |  
 |  Interface used to load a :class:`DataFrame` from external storage systems
 |  (e.g. file systems, key-value stores, etc). Use :attr:`SparkSession.read`
 |  to access this.
 |  
 |  .. versionadded:: 1.4
 |  
 |  Method resolution order:
 |      DataFrameReader
 |      OptionUtils
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, spark)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None, comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, columnNameOfCorruptRecord=None, multiLi

In [8]:
help(spark.read.csv)

Help on method csv in module pyspark.sql.readwriter:

csv(path, schema=None, sep=None, encoding=None, quote=None, escape=None, comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None, samplingRatio=None, enforceSchema=None, emptyValue=None, locale=None, lineSep=None, pathGlobFilter=None, recursiveFileLookup=None, modifiedBefore=None, modifiedAfter=None, unescapedQuoteHandling=None) method of pyspark.sql.readwriter.DataFrameReader instance
    Loads a CSV file and returns the result as a  :class:`DataFrame`.
    
    This function will go through the input once to determine the input schema if
    ``inferSchema`` is enabled. To avoid going through the entire data once, di

In [9]:
scores = spark.read.csv(scores_csv)

[Stage 0:>                                                          (0 + 1) / 1]                                                                                

In [10]:
scores

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string, _c9: string, _c10: string, _c11: string]

In [11]:
help(scores.show)

Help on method show in module pyspark.sql.dataframe:

show(n=20, truncate=True, vertical=False) method of pyspark.sql.dataframe.DataFrame instance
    Prints the first ``n`` rows to the console.
    
    .. versionadded:: 1.3.0
    
    Parameters
    ----------
    n : int, optional
        Number of rows to show.
    truncate : bool or int, optional
        If set to ``True``, truncate strings longer than 20 chars by default.
        If set to a number greater than one, truncates long strings to length ``truncate``
        and align cells right.
    vertical : bool, optional
        If set to ``True``, print output rows vertically (one line
        per column value).
    
    Examples
    --------
    >>> df
    DataFrame[age: int, name: string]
    >>> df.show()
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|
    |  5|  Bob|
    +---+-----+
    >>> df.show(truncate=3)
    +---+----+
    |age|name|
    +---+----+
    |  2| Ali|
    |  5| Bob|
    +---+----+
    >>> df

We can look at the head of the DataFrame calling the `show` method.

scores.show()

**Can anyone spot what's wrong with the above data?**

- Question marks
- Column names
- `Float` and `Int` in the same column

Let's check the schema of our DataFrame

In [12]:
help(scores.printSchema)

Help on method printSchema in module pyspark.sql.dataframe:

printSchema() method of pyspark.sql.dataframe.DataFrame instance
    Prints out the schema in the tree format.
    
    .. versionadded:: 1.3.0
    
    Examples
    --------
    >>> df.printSchema()
    root
     |-- age: integer (nullable = true)
     |-- name: string (nullable = true)
    <BLANKLINE>



In [13]:
scores.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)



**Why everythin is a `String`?**

### Managing Schema and Null Values

In [14]:
scores_df = (
    spark.read
        .option("header", "true")
        .option("nullValue", "?")
        .option("inferSchema", "true")
        .csv(scores_csv)
)

[Stage 2:>                                                          (0 + 1) / 1]

                                                                                

In [15]:
scores_df.printSchema()

root
 |-- id_1: integer (nullable = true)
 |-- id_2: integer (nullable = true)
 |-- cmp_fname_c1: double (nullable = true)
 |-- cmp_fname_c2: double (nullable = true)
 |-- cmp_lname_c1: double (nullable = true)
 |-- cmp_lname_c2: double (nullable = true)
 |-- cmp_sex: integer (nullable = true)
 |-- cmp_bd: integer (nullable = true)
 |-- cmp_bm: integer (nullable = true)
 |-- cmp_by: integer (nullable = true)
 |-- cmp_plz: integer (nullable = true)
 |-- is_match: boolean (nullable = true)



In [16]:
scores_df.show(5)

+-----+-----+-----------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|     cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+-----------------+------------+------------+------------+-------+------+------+------+-------+--------+
|37291|53113|0.833333333333333|        null|         1.0|        null|      1|     1|     1|     1|      0|    true|
|39086|47614|              1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|70031|70237|              1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|84795|97439|              1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|36950|42116|              1.0|        null|         1.0|         1.0|      1|     1|     1|     1|      1|    true|
+-----+-----+-----------------+------------+------------+-------

## Transformations and Actions

Creating a DataFrame does not cause any distributed computation in the cluster. **A DataFrame is un data set representing an intermediate step in a computation**.

For operatring data (in a distributed manner), we have two type of operations: **transformations** and **actions**:

- Transformations: lazy evaluation. They're not computed immediately, but they are recorded as a **lineage** for query play optimization.
- Actions: distributed computation occurs after invoking an action

In [17]:
# how many?
scores_df.count()

574913

We can use the `collect` action to return `Array` with all the `Row` objects in our DataFrame.

In [18]:
scores_df.collect()

[Stage 7:>                                                          (0 + 1) / 1]

                                                                                

[Row(id_1=37291, id_2=53113, cmp_fname_c1=0.833333333333333, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=0, is_match=True),
 Row(id_1=39086, id_2=47614, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=1, is_match=True),
 Row(id_1=70031, id_2=70237, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=1, is_match=True),
 Row(id_1=84795, id_2=97439, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=1, is_match=True),
 Row(id_1=36950, id_2=42116, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=1.0, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=1, is_match=True),
 Row(id_1=42413, id_2=48491, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1

**The `Array` will reside in local memory!!**

## Write to Disk

We are going to save the DataFrame into a different format: Parquet

In [19]:
scores_df.write.format("parquet").save("data/scores-parquet")

[Stage 8:>                                                          (0 + 1) / 1]

                                                                                

In [20]:
! ls data/scores-parquet

_SUCCESS  part-00000-72cfc7af-ce20-4fad-b762-4940f156cafe-c000.snappy.parquet


In [21]:
scores_parquet = spark.read.parquet("data/scores-parquet")

In [22]:
scores_parquet.printSchema()

root
 |-- id_1: integer (nullable = true)
 |-- id_2: integer (nullable = true)
 |-- cmp_fname_c1: double (nullable = true)
 |-- cmp_fname_c2: double (nullable = true)
 |-- cmp_lname_c1: double (nullable = true)
 |-- cmp_lname_c2: double (nullable = true)
 |-- cmp_sex: integer (nullable = true)
 |-- cmp_bd: integer (nullable = true)
 |-- cmp_bm: integer (nullable = true)
 |-- cmp_by: integer (nullable = true)
 |-- cmp_plz: integer (nullable = true)
 |-- is_match: boolean (nullable = true)



In [23]:
scores_parquet.show(5)

+-----+-----+-----------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|     cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+-----------------+------------+------------+------------+-------+------+------+------+-------+--------+
|37291|53113|0.833333333333333|        null|         1.0|        null|      1|     1|     1|     1|      0|    true|
|39086|47614|              1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|70031|70237|              1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|84795|97439|              1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|36950|42116|              1.0|        null|         1.0|         1.0|      1|     1|     1|     1|      1|    true|
+-----+-----+-----------------+------------+------------+-------

## Analyzing Data

All good for now, but we don't load data for the sake of i, we do it because we want to run some analysis.

- First two column are Integer IDs. There represent the patients that were matched in the record.
- The next nine column are numeric values (int and double). They represnt match scores on different fields, such name, sex, birthday, and locations.
- The last column is a boolean value indicating whether or not the pair of patient records represented by the line was a match.

**We could use this dataset to build a simple classifier that allows us to predict whether a record will be a match based on the values of the match scores for patient records.**

### Caching

Each time we process data (e.g., calling the `collect` method), Spark re-opens the file, parsea the rows, and then execute the requested action. It does not matter if we have filtered the data and created a smaller set of record.

We can use the `cache` method to indicate to store the DataFrame in memory.

In [24]:
help(scores_df.cache)

Help on method cache in module pyspark.sql.dataframe:

cache() method of pyspark.sql.dataframe.DataFrame instance
    Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`).
    
    .. versionadded:: 1.3.0
    
    Notes
    -----
    The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.



**Spark is in-memory only. Myth or misconception?**

"Spill"
Storage levels:
- `MEMORY_AND_DISK`
- `MEMORY`
- `MEMORY_SER`

In [25]:
scores_cached = scores_df.cache()

In [26]:
scores_cached.count()

[Stage 11:>                                                         (0 + 1) / 1]

                                                                                

574913

In [27]:
scores_cached.take(10)

[Row(id_1=37291, id_2=53113, cmp_fname_c1=0.833333333333333, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=0, is_match=True),
 Row(id_1=39086, id_2=47614, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=1, is_match=True),
 Row(id_1=70031, id_2=70237, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=1, is_match=True),
 Row(id_1=84795, id_2=97439, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=1, is_match=True),
 Row(id_1=36950, id_2=42116, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=1.0, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=1, is_match=True),
 Row(id_1=42413, id_2=48491, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1

### Query Plan

In [28]:
scores_cached.explain()

== Physical Plan ==
FileScan csv [id_1#56,id_2#57,cmp_fname_c1#58,cmp_fname_c2#59,cmp_lname_c1#60,cmp_lname_c2#61,cmp_sex#62,cmp_bd#63,cmp_bm#64,cmp_by#65,cmp_plz#66,is_match#67] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/runner/work/bda-course/bda-course/coursebook/modules/m2/dat..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id_1:int,id_2:int,cmp_fname_c1:double,cmp_fname_c2:double,cmp_lname_c1:double,cmp_lname_c2...




### GroupBy + OrderBy

In [29]:
from pyspark.sql.functions import col

scores_cached.groupBy("is_match").count().orderBy(col("count").desc()).show()

+--------+------+
|is_match| count|
+--------+------+
|   false|572820|
|    true|  2093|
+--------+------+



## Aggregation Functions

In addition to `count`, we can also compute more complex aggregation like sums, mins, maxes, means, and standard deviation. How? we use `agg` method of the DataFrame API.

In [30]:
from pyspark.sql.functions import avg, stddev

In [31]:
aggregated = scores_cached.agg(avg("cmp_sex"), stddev("cmp_sex"))

In [32]:
aggregated.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[avg(cmp_sex#62), stddev_samp(cast(cmp_sex#62 as double))])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#220]
      +- HashAggregate(keys=[], functions=[partial_avg(cmp_sex#62), partial_stddev_samp(cast(cmp_sex#62 as double))])
         +- InMemoryTableScan [cmp_sex#62]
               +- InMemoryRelation [id_1#56, id_2#57, cmp_fname_c1#58, cmp_fname_c2#59, cmp_lname_c1#60, cmp_lname_c2#61, cmp_sex#62, cmp_bd#63, cmp_bm#64, cmp_by#65, cmp_plz#66, is_match#67], StorageLevel(disk, memory, deserialized, 1 replicas)
                     +- FileScan csv [id_1#56,id_2#57,cmp_fname_c1#58,cmp_fname_c2#59,cmp_lname_c1#60,cmp_lname_c2#61,cmp_sex#62,cmp_bd#63,cmp_bm#64,cmp_by#65,cmp_plz#66,is_match#67] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/runner/work/bda-course/bda-course/coursebook/modules/m2/dat..., PartitionFilters: [], PushedFilters: 

In [33]:
aggregated.show()

+------------------+--------------------+
|      avg(cmp_sex)|stddev_samp(cmp_sex)|
+------------------+--------------------+
|0.9550923357099248| 0.20710152240504734|
+------------------+--------------------+



## SQL

ANSI 2003-compliant version or HiveQL.

In [34]:
scores_df.createOrReplaceTempView("scores")

In [35]:
# scores_cached.groupBy("is_match").count().orderBy(col("count").desc()).show()
spark.sql("""
    SELECT is_match, COUNT(*) cnt
    FROM scores
    GROUP BY is_match
    ORDER BY cnt DESC
""").show()

+--------+------+
|is_match|   cnt|
+--------+------+
|   false|572820|
|    true|  2093|
+--------+------+



In [36]:
spark.sql("""
    SELECT is_match, COUNT(*) cnt
    FROM scores
    GROUP BY is_match
    ORDER BY cnt DESC
""").explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [cnt#2165L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(cnt#2165L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#330]
      +- HashAggregate(keys=[is_match#67], functions=[count(1)])
         +- Exchange hashpartitioning(is_match#67, 200), ENSURE_REQUIREMENTS, [id=#327]
            +- HashAggregate(keys=[is_match#67], functions=[partial_count(1)])
               +- InMemoryTableScan [is_match#67]
                     +- InMemoryRelation [id_1#56, id_2#57, cmp_fname_c1#58, cmp_fname_c2#59, cmp_lname_c1#60, cmp_lname_c2#61, cmp_sex#62, cmp_bd#63, cmp_bm#64, cmp_by#65, cmp_plz#66, is_match#67], StorageLevel(disk, memory, deserialized, 1 replicas)
                           +- FileScan csv [id_1#56,id_2#57,cmp_fname_c1#58,cmp_fname_c2#59,cmp_lname_c1#60,cmp_lname_c2#61,cmp_sex#62,cmp_bd#63,cmp_bm#64,cmp_by#65,cmp_plz#66,is_match#67] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 pat

In [37]:
scores_cached.groupBy("is_match").count().orderBy(col("count").desc()).explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#2364L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#2364L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#350]
      +- HashAggregate(keys=[is_match#67], functions=[count(1)])
         +- Exchange hashpartitioning(is_match#67, 200), ENSURE_REQUIREMENTS, [id=#347]
            +- HashAggregate(keys=[is_match#67], functions=[partial_count(1)])
               +- InMemoryTableScan [is_match#67]
                     +- InMemoryRelation [id_1#56, id_2#57, cmp_fname_c1#58, cmp_fname_c2#59, cmp_lname_c1#60, cmp_lname_c2#61, cmp_sex#62, cmp_bd#63, cmp_bm#64, cmp_by#65, cmp_plz#66, is_match#67], StorageLevel(disk, memory, deserialized, 1 replicas)
                           +- FileScan csv [id_1#56,id_2#57,cmp_fname_c1#58,cmp_fname_c2#59,cmp_lname_c1#60,cmp_lname_c2#61,cmp_sex#62,cmp_bd#63,cmp_bm#64,cmp_by#65,cmp_plz#66,is_match#67] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1

### Should I use Spark SQL or the DataFrame API

Depends on the query. 

## Pandas is my friend!

required packages: `pandas` and `numpy`
- poetry add pandas numpy
- pip install pandas numpy

In [38]:
scores_pandas = scores_df.toPandas()

[Stage 24:>                                                         (0 + 1) / 1]

                                                                                

In [39]:
scores_pandas.head()

Unnamed: 0,id_1,id_2,cmp_fname_c1,cmp_fname_c2,cmp_lname_c1,cmp_lname_c2,cmp_sex,cmp_bd,cmp_bm,cmp_by,cmp_plz,is_match
0,37291,53113,0.833333,,1.0,,1,1.0,1.0,1.0,0.0,True
1,39086,47614,1.0,,1.0,,1,1.0,1.0,1.0,1.0,True
2,70031,70237,1.0,,1.0,,1,1.0,1.0,1.0,1.0,True
3,84795,97439,1.0,,1.0,,1,1.0,1.0,1.0,1.0,True
4,36950,42116,1.0,,1.0,1.0,1,1.0,1.0,1.0,1.0,True


In [40]:
scores_pandas.shape

(574913, 12)

## References

```{bibliography}
:style: unsrt
:filter: docname in docnames
```