# Introduction to the DataFrame API

In this section, we will introduce the [DataFrame and Dataset APIs](https://spark.apache.org/docs/latest/sql-programming-guide.html).

We will use a small subset from the [Record Linkage Comparison Data Set](https://archive.ics.uci.edu/ml/datasets/record+linkage+comparison+patterns), borrowed from UC Irvine Machine Learning Repository. It consists of several CSV files with match scores for patients in a Germany hospital, but we will use only one of them for the sake of simplicity. Please consult {cite:p}`schmidtmann2009evaluation` and {cite:p}`sariyar2011controlling` for more details regarding the data sets and research. 

## Setup
- Setup a `SparkSession` to work with the Dataset and DataFrame API
- Unzip the `scores.zip` file located under `data` folder.

In [1]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("intro-to-df").setMaster("local")
sc = SparkContext(conf=conf)
# Avoid polluting the console with warning messages
sc.setLogLevel("ERROR")



Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/03/07 17:03:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Create a SparkSession to work with the DataFrame API

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession(sc)

In [3]:
help(SparkSession)

Help on class SparkSession in module pyspark.sql.session:

class SparkSession(pyspark.sql.pandas.conversion.SparkConversionMixin)
 |  SparkSession(sparkContext, jsparkSession=None)
 |  
 |  The entry point to programming Spark with the Dataset and DataFrame API.
 |  
 |  A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
 |  tables, execute SQL over tables, cache tables, and read parquet files.
 |  To create a :class:`SparkSession`, use the following builder pattern:
 |  
 |  .. autoattribute:: builder
 |     :annotation:
 |  
 |  Examples
 |  --------
 |  >>> spark = SparkSession.builder \
 |  ...     .master("local") \
 |  ...     .appName("Word Count") \
 |  ...     .config("spark.some.config.option", "some-value") \
 |  ...     .getOrCreate()
 |  
 |  >>> from datetime import datetime
 |  >>> from pyspark.sql import Row
 |  >>> spark = SparkSession(sc)
 |  >>> allTypes = sc.parallelize([Row(i=1, s="string", d=1.0, l=1,
 |  ...     b=True, list=[1, 

### Unzip the scores file, if it was not done already

In [4]:
from os import path
scores_zip = path.join("data", "scores.zip")
scores_csv = path.join("data", "scores.csv")

%set_env SCORES_ZIP=$scores_zip
%set_env SCORES_CSV=$scores_csv

env: SCORES_ZIP=data/scores.zip
env: SCORES_CSV=data/scores.csv


In [5]:
%%bash
command -v unzip >/dev/null 2>&1 || { echo >&2 "unzip command is not installed. Aborting."; exit 1; }
[[ -f "$SCORES_CSV" ]] && { echo "file data/$SCORES_CSV already exist. Skipping."; exit 0; }

[[ -f "$SCORES_ZIP" ]] || { echo "file data/$SCORES_ZIP does not exist. Aborting."; exit 1; }

echo "Unzip file $SCORES_ZIP"
unzip "$SCORES_ZIP" -d data

Unzip file data/scores.zip


Archive:  data/scores.zip


  inflating: data/scores.csv         


  inflating: data/__MACOSX/._scores.csv  


In [6]:
! head "$SCORES_CSV"

"id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match"
37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE
39086,47614,1,?,1,?,1,1,1,1,1,TRUE
70031,70237,1,?,1,?,1,1,1,1,1,TRUE
84795,97439,1,?,1,?,1,1,1,1,1,TRUE
36950,42116,1,?,1,1,1,1,1,1,1,TRUE
42413,48491,1,?,1,?,1,1,1,1,1,TRUE
25965,64753,1,?,1,?,1,1,1,1,1,TRUE
49451,90407,1,?,1,?,1,1,1,1,0,TRUE
39932,40902,1,?,1,?,1,1,1,1,1,TRUE


## Loading the Scores CSV file into a DataFrame

We are going to use the Reader API

In [7]:
help(spark.read)

Help on DataFrameReader in module pyspark.sql.readwriter object:

class DataFrameReader(OptionUtils)
 |  DataFrameReader(spark)
 |  
 |  Interface used to load a :class:`DataFrame` from external storage systems
 |  (e.g. file systems, key-value stores, etc). Use :attr:`SparkSession.read`
 |  to access this.
 |  
 |  .. versionadded:: 1.4
 |  
 |  Method resolution order:
 |      DataFrameReader
 |      OptionUtils
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, spark)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None, comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, columnNameOfCorruptRecord=None, multiLi

In [8]:
help(spark.read.csv)

Help on method csv in module pyspark.sql.readwriter:

csv(path, schema=None, sep=None, encoding=None, quote=None, escape=None, comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None, samplingRatio=None, enforceSchema=None, emptyValue=None, locale=None, lineSep=None, pathGlobFilter=None, recursiveFileLookup=None, modifiedBefore=None, modifiedAfter=None, unescapedQuoteHandling=None) method of pyspark.sql.readwriter.DataFrameReader instance
    Loads a CSV file and returns the result as a  :class:`DataFrame`.
    
    This function will go through the input once to determine the input schema if
    ``inferSchema`` is enabled. To avoid going through the entire data once, di

In [9]:
scores = spark.read.csv(scores_csv)

In [10]:
scores

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string, _c9: string, _c10: string, _c11: string]

In [11]:
help(scores.show)

Help on method show in module pyspark.sql.dataframe:

show(n=20, truncate=True, vertical=False) method of pyspark.sql.dataframe.DataFrame instance
    Prints the first ``n`` rows to the console.
    
    .. versionadded:: 1.3.0
    
    Parameters
    ----------
    n : int, optional
        Number of rows to show.
    truncate : bool or int, optional
        If set to ``True``, truncate strings longer than 20 chars by default.
        If set to a number greater than one, truncates long strings to length ``truncate``
        and align cells right.
    vertical : bool, optional
        If set to ``True``, print output rows vertically (one line
        per column value).
    
    Examples
    --------
    >>> df
    DataFrame[age: int, name: string]
    >>> df.show()
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|
    |  5|  Bob|
    +---+-----+
    >>> df.show(truncate=3)
    +---+----+
    |age|name|
    +---+----+
    |  2| Ali|
    |  5| Bob|
    +---+----+
    >>> df

We can look at the head of the DataFrame calling the `show` method.

scores.show()

**Can anyone spot what's wrong with the above data?**

- Question marks
- Column names
- `Float` and `Int` in the same column

Let's check the schema of our DataFrame

In [12]:
help(scores.printSchema)

Help on method printSchema in module pyspark.sql.dataframe:

printSchema() method of pyspark.sql.dataframe.DataFrame instance
    Prints out the schema in the tree format.
    
    .. versionadded:: 1.3.0
    
    Examples
    --------
    >>> df.printSchema()
    root
     |-- age: integer (nullable = true)
     |-- name: string (nullable = true)
    <BLANKLINE>



In [13]:
scores.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)



**Why everythin is a `String`?**

### Managing Schema and Null Values

In [14]:
scores_df = (
    spark.read
        .option("header", "true")
        .option("nullValue", "?")
        .option("inferSchema", "true")
        .csv(scores_csv)
)

[Stage 2:>                                                          (0 + 1) / 1]

                                                                                

In [15]:
scores_df.printSchema()

root
 |-- id_1: integer (nullable = true)
 |-- id_2: integer (nullable = true)
 |-- cmp_fname_c1: double (nullable = true)
 |-- cmp_fname_c2: double (nullable = true)
 |-- cmp_lname_c1: double (nullable = true)
 |-- cmp_lname_c2: double (nullable = true)
 |-- cmp_sex: integer (nullable = true)
 |-- cmp_bd: integer (nullable = true)
 |-- cmp_bm: integer (nullable = true)
 |-- cmp_by: integer (nullable = true)
 |-- cmp_plz: integer (nullable = true)
 |-- is_match: boolean (nullable = true)



In [16]:
scores_df.show(5)

+-----+-----+-----------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|     cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+-----------------+------------+------------+------------+-------+------+------+------+-------+--------+
|37291|53113|0.833333333333333|        null|         1.0|        null|      1|     1|     1|     1|      0|    true|
|39086|47614|              1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|70031|70237|              1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|84795|97439|              1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|36950|42116|              1.0|        null|         1.0|         1.0|      1|     1|     1|     1|      1|    true|
+-----+-----+-----------------+------------+------------+-------

## References

```{bibliography}
:style: unsrt
:filter: docname in docnames
```