-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# DataFrames and Transformations Review
## De-Duping Data Lab

In this exercise, we're doing ETL on a file we've received from a customer. That file contains data about people, including:

* first, middle and last names
* gender
* birth date
* Social Security number
* salary

But, as is unfortunately common in data we get from this customer, the file contains some duplicate records. Worse:

* In some of the records, the names are mixed case (e.g., "Carol"), while in others, they are uppercase (e.g., "CAROL").
* The Social Security numbers aren't consistent either. Some of them are hyphenated (e.g., "992-83-4829"), while others are missing hyphens ("992834829").

If all of the name fields match -- if you disregard character case -- then the birth dates and salaries are guaranteed to match as well,
and the Social Security Numbers *would* match if they were somehow put in the same format.

Your job is to remove the duplicate records. The specific requirements of your job are:

* Remove duplicates. It doesn't matter which record you keep; it only matters that you keep one of them.
* Preserve the data format of the columns. For example, if you write the first name column in all lowercase, you haven't met this requirement.

<img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> The initial dataset contains 103,000 records.
The de-duplicated result has 100,000 records.

Next, write the results in **Delta** format as a **single data file** to the directory given by the variable *deltaDestDir*.

<img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> Remember the relationship between the number of partitions in a DataFrame and the number of files written.

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output" target="_blank">DataFrameReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output" target="_blank">DataFrameWriter</a>

In [0]:
%run ./Includes/Classroom-Setup

In [0]:
spark.catalog.clearCache()

It's helpful to look at the file first, so you can check the format. `dbutils.fs.head()` (or just `%fs head`) is a big help here.

In [0]:
%fs head dbfs:/mnt/training/dataframes/people-with-dups.txt

In [0]:
# TODO

sourceFile = "dbfs:/mnt/training/dataframes/people-with-dups.txt"
destFile = workingDir + "people.parquet/"

# In case it already exists
dbutils.fs.rm(destFile, True)

# Complete your work here...


Out[33]: False

In [0]:
destFile

Out[34]: 'dbfs:/user/hamed.vaheb@pwc.lu/dbacademy/spark_programming/asp_3_4_review/people.parquet/'

In [0]:
BronzeDF = (spark.read
  .format("csv")
  .option("delimiter", ":")
  .option("header", "true")
  .load(sourceFile)
)
display(BronzeDF)

firstName,middleName,lastName,gender,birthDate,salary,ssn
Emanuel,Wallace,Panton,M,1988-03-04,101255,935-90-7627
Eloisa,Rubye,Cayouette,F,2000-06-20,204031,935-89-9009
Cathi,Svetlana,Prins,F,2012-12-22,35895,959-30-7957
Mitchel,Andres,Mozdzierz,M,1966-05-06,55108,989-27-8093
Angla,Melba,Hartzheim,F,1938-07-26,13199,935-27-4276
Rachel,Marlin,Borremans,F,1923-02-23,67070,996-41-8616
Catarina,Phylicia,Dominic,F,1969-09-29,201021,999-84-8888
Antione,Randy,Hamacher,M,2004-03-05,271486,917-96-3554
Madaline,Shawanda,Piszczek,F,1996-03-17,183944,963-87-9974
Luciano,Norbert,Sarcone,M,1962-12-14,73069,909-96-1669


In [0]:
BronzeDF.count()

Out[36]: 103000

In [0]:
BronzeDF.rdd.getNumPartitions()

Out[37]: 2

In [0]:
BronzeDF = BronzeDF.coalesce(1)

In [0]:
BronzeDF.rdd.getNumPartitions()

Out[39]: 1

In [0]:
from pyspark.sql.functions import col, regexp_replace, regexp_extract, split, when, concat_ws, udf, size, expr, trim, lit, element_at, to_date, date_format, lower, length, to_timestamp, monotonically_increasing_id
from pyspark.sql.types import ArrayType, StringType, BooleanType, IntegerType, LongType, TimestampType, DateType
#from unidecode import unidecode
import re
from functools import reduce
from pyspark.sql.functions import initcap
# for checking data types in functions
## for specifying list data type
from typing import List
## for specifying spark dataframe
from pyspark.sql import DataFrame



# COMMAND ----------

# MAGIC %md
# MAGIC ## Define Python UDFs

# COMMAND ----------

# MAGIC %md
# MAGIC In the implementations of this notebook, the usage of native Spark functions is prefered over Python user defined functions (UDFs), as spark native functions are more efficient.
# MAGIC It is attempted to avoid using the UDFs as much. Those UDFs that are used are defined below:

# COMMAND ----------


# Used in the primary cleaning function ASCIItoUnicode
remove_accents_udf = udf(lambda x: unidecode(x) if x is not None else x)

def is_all_upper(input_str):
    pattern = r'^[A-Z ]+$'
    return bool(re.match(pattern, input_str))

udf_is_all_upper = udf(is_all_upper, BooleanType())


# Used in the function clean_first_last_name for handling special case of articles
articles = ['van', 'Van', 'VAN', 'von', 'VON', 'Von', 'de la', 'De La', 'DE LA', 'Der', 'der', 'DER', 'de', 'De', 'DE', 'du', 'DU', 'Du', 'DA', 'da', 'Da', 'Di', 'DI', 'di']
def article_occur_index(lst):
    if lst is not None:
        set1 = set(lst)
        set2 = set(articles)
        intersection = set1.intersection(set2)
        if not intersection:
            return None
        for i in range(len(lst)-1, -1, -1):
            if lst[i] in intersection:
                return i
        return None
    else:
        return None

      
article_occur_index_udf = udf(article_occur_index, IntegerType())    

# register the Python UDF as Spark functions so that it will be recognized in the `expr` function
spark.udf.register("article_occur_index_udf", article_occur_index_udf)        


# COMMAND ----------

# MAGIC %md
# MAGIC ## Primary Cleaning Functions on Columns

# COMMAND ----------

# MAGIC %md
# MAGIC In order to avoid repeating the explanation of similar operations, and also to have a unified framework, some recurring operations are introduced and new coins are termed for them, which are refered to in subsequent documentation for functions of the notebook:

# COMMAND ----------

def lower_column_names(sdf: DataFrame) -> DataFrame:
    """
    Convert name of columns of the user input dataframe to lowerase version.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with renamed columns.
    """
    
    res = sdf
    cols = res.schema.names
    cols_lower = [c.lower() for c in cols]
    res = reduce(lambda res, idx: res.withColumnRenamed(cols[idx], cols_lower[idx]), range(len(cols)), res)
    return res

# COMMAND ----------

def ASCIItoUnicode(sdf: DataFrame, colname: str, cond_col: str = None, cond: str = "y") -> DataFrame:
    """
    Uses the UDF `remove_accents_udf` function to convert the the input column of the input spark dataframe to ASCII format. The `remove_accents_udf` UDF utilizes Python's `unidecode` library.
    The quality of resulting ASCII formats obtained by `unidecode` function should be between good and perfect for languages of western origin.
    Therefore, since this is the case for Golden Record sources, the results are satisfactory.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (str): Name of column to be modified.
            cond_col (str): Name of column to be conditioned on. If no value is given, no conditioning will occur. Default value is "None".
            cond (str): Whether the condition should be on the trueness of `cond_col` ("y") or falseness of `cond_col` (any other value). Default value is "y".

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with modified column `colname`. This column is converted to ASCII format.
    """

    res = sdf
    if cond_col == None:
        res = res.withColumn(colname, remove_accents_udf(colname))
    else:
        if cond_col in res.columns:
            if cond.lower() != "y":
                res = res.withColumn(
                    colname,
                    when(~col(cond_col), remove_accents_udf(colname)).otherwise(
                        col(colname)
                    ),
                )
            else:
                res = res.withColumn(
                    colname,
                    when(col(cond_col), remove_accents_udf(colname)).otherwise(
                        col(colname)
                    ),
                )
    return res

# COMMAND ----------

def RemoveWhitespace(sdf: DataFrame, colname: str, cond_col: str = None, cond: str = "y") -> DataFrame:
    """
    Applies two operations on the input column of the input spark dataframe, first is using `regexp_replace(colname, r"\s{2,}", " ")` to remove any more than two spaces that are consecutive.
    Secondly, the trim function is used to remove spaces at beginning and end of the strings of a column.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (str): Name of column to be modified.
            cond_col (str): Name of column to be conditioned on. If no value is given, no conditioning will occur. Default value is "None".
            cond (str): Whether the condition should be on the trueness of `cond_col` ("y") or falseness of `cond_col` (any other value). Default value is "y".

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with modified column `colname`. Whitespaces are removed from this column.
    """

    res = sdf
    if cond_col == None:
        res = res.withColumn(colname, regexp_replace(colname, r"\s{2,}", " "))
        res = res.withColumn(colname, trim(col(colname)))
    else:
        if cond_col in res.columns:
            if cond.lower() != "y":
                res = res.withColumn(
                    colname,
                    when(
                        ~col(cond_col), regexp_replace(colname, r"\s{2,}", " ")
                    ).otherwise(col(colname)),
                )
                res = res.withColumn(
                    colname,
                    when(~col(cond_col), trim(col(colname))).otherwise(col(colname)),
                )
            else:
                res = res.withColumn(
                    colname,
                    when(
                        col(cond_col), regexp_replace(colname, r"\s{2,}", " ")
                    ).otherwise(col(colname)),
                )
                res = res.withColumn(
                    colname,
                    when(col(cond_col), trim(col(colname))).otherwise(col(colname)),
                )
    return res

# COMMAND ----------

def RemoveUnwantedChars(sdf: DataFrame, colname: str, unwanted_char_pattern: str, cond_col: str = None, cond: str = "y") -> DataFrame:
    """
    Removes unexpected characters from the input column of the input spark dataframe using `regexp_replace(colname, unwanted_char_pattern, ""))`.
    In `unwanted_char_pattern`, usually instead of specifying characters to remove, the characters that are intended to be kept are stated, and they are followed by a `^` at beginning of the `unwanted_char_pattern` pattern, so as to exclude the desired characters,
    and thus everything other than these characters are removed. The decision of which kind of characters should be expected and hence kept, and which should be removed depend on the column under study.
    For instance, for cleaning names, more precisely, cleaning the columns `firstname` and `lastname`, the pattern `[^a-zA-Z\s]` is used, which meaning alphabets (`a-zA-Z`), and whitespaces (`\s`) are only allowed.
    On the other hand, for column `DateOfBirthDateOfBirth` which contains birthdates, the pattern `[^a-zA-Z\d\s\d\\\\\/]` is used, as some birthdates contain the symbols `\` or `/`, so these symbols are kept.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (str): Name of column to be modified.
            unwanted_char_pattern (str): Regex pattern of unwanted characters. The function will remove any substring of the input column that matches this pattern.
            cond_col (str): Name of column to be conditioned on. If no value is given, no conditioning will occur. Default value is "None".
            cond (str): Whether the condition should be on the trueness of `cond_col` ("y") or falseness of `cond_col` (any other value). Default value is "y".

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with modified column `colname`. Unwanted (unexpected) characters are removed from this column based on user's regex pattern containing what to remove.
    """

    res = sdf
    if cond_col == None:
        res = res.withColumn(
            colname, regexp_replace(colname, unwanted_char_pattern, "")
        )
    else:
        if cond_col in res.columns:
            if cond.lower() != "y":
                res = res.withColumn(
                    colname,
                    when(
                        ~col(cond_col),
                        regexp_replace(colname, unwanted_char_pattern, ""),
                    ).otherwise(col(colname)),
                )
            else:
                res = res.withColumn(
                    colname,
                    when(
                        col(cond_col),
                        regexp_replace(colname, unwanted_char_pattern, ""),
                    ).otherwise(col(colname)),
                )
    return res

# COMMAND ----------

def HandleNullVals(sdf: DataFrame, colname: str, cond_col: str = None, cond: str = "y") -> DataFrame:
    """
    Replaces values of of the input column of the input spark dataframe that implies missing data with null values. The value implying null value differs on columns, but to use the same function, all the following values are replaced with syntactic null:
     `unknown`, `null`, `na`, `""`, `" "`. The lowercase version of the string under study is compared with the mentioned values so that lowercase and uppercase differences don't play any role.
     Usually `RemoveWhitespace` should be applied before `HandleNullVals` so that all empty strings all properly replaced with null values.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (str): Name of column to be modified.
            cond_col (str): Name of column to be conditioned on. If no value is given, no conditioning will occur. Default value is "None".
            cond (str): Whether the condition should be on the trueness of `cond_col` ("y") or falseness of `cond_col` (any other value). Default value is "y".

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with modified column `colname`. Any value in column implying missing value is now replaced with syntactic null.
    """

    res = sdf
    if cond_col == None:
        res = res.withColumn(
            colname,
            when(
                (lower(col(colname)) == "unknown")
                | (lower(col(colname)) == "null")
                | (lower(col(colname)) == "na")
                | (col(colname) == "")
                | (col(colname) == " "),
                lit(None),
            ).otherwise(col(colname)),
        )
    else:
        if cond_col in res.columns:
            if cond.lower() != "y":
                res = res.withColumn(
                    colname,
                    when(
                        ~col(cond_col) & (lower(col(colname)) == "unknown")
                        | (lower(col(colname)) == "null")
                        | (lower(col(colname)) == "na")
                        | (col(colname) == "")
                        | (col(colname) == " "),
                        lit(None),
                    ).otherwise(col(colname)),
                )
            else:
                res = res.withColumn(
                    colname,
                    when(
                        col(cond_col) & (lower(col(colname)) == "unknown")
                        | (lower(col(colname)) == "null")
                        | (lower(col(colname)) == "na")
                        | (col(colname) == "")
                        | (col(colname) == " "),
                        lit(None),
                    ).otherwise(col(colname)),
                )
    return res

# COMMAND ----------

def ConvertToLower(sdf: DataFrame, colname: str, cond_col: str = None, cond: str = "y") -> DataFrame:
    """
    Converts all characters of the input column of the input spark dataframe to lowercase.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (str): Name of column to be modified.
            cond_col (str): Name of column to be conditioned on. If no value is given, no conditioning will occur. Default value is "None".
            cond (str): Whether the condition should be on the trueness of `cond_col` ("y") or falseness of `cond_col` (any other value). Default value is "y".

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with modified column `colname`. This column's contents are converted to lowercase version.
    """

    res = sdf
    if cond_col == None:
        res = res.withColumn(colname, lower(col(colname)))
    else:
        if cond_col in res.columns:
            if cond.lower() != "y":
                res = res.withColumn(
                    colname,
                    when(~col(cond_col), lower(col(colname))).otherwise(col(colname)),
                )
            else:
                res = res.withColumn(
                    colname,
                    when(col(cond_col), lower(col(colname))).otherwise(col(colname)),
                )
    return res

# COMMAND ----------

def ParseDate(sdf: DataFrame, colname: str, cond_col: str = None, cond: str = "y", DateFormat: str = "dd/MM/yyyy") -> DataFrame:
    """
    Parses the input column (should contain dates) based on the input date format.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (str): Name of column to be modified.
            cond_col (str): Name of column to be conditioned on. If no value is given, no conditioning will occur. Default value is "None".
            cond (str): Whether the condition should be on the trueness of `cond_col` ("y") or falseness of `cond_col` (any other value). Default value is "y".
            DateFormat (str): Format of date inherent in the input column. Default value is "dd/MM/yyyy".

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with modified column `colname`. This column is now parsed to the user's date format.
    """

    res = sdf
    if cond_col == None:
        res = res.withColumn(colname, to_date(colname, DateFormat))
    else:
        if cond_col in res.columns:
            if cond.lower() != "y":
                res = res.withColumn(
                    colname,
                    when(~col(cond_col), to_date(colname, DateFormat)).otherwise(
                        col(colname)
                    ),
                )
            else:
                res = res.withColumn(
                    colname,
                    when(col(cond_col), to_date(colname, DateFormat)).otherwise(
                        col(colname)
                    ),
                )
    return res

# COMMAND ----------

def ConvertDate(sdf: DataFrame, colname: str, cond_col: str = None, cond: str = "y", DateFormat: str ="yyyy-dd-MM") -> DataFrame:
    """
    Converts the date format of input column (should contain dates and be of type date) to a desired input date format.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (str): Name of column to be modified.
            cond_col (str): Name of column to be conditioned on. If no value is given, no conditioning will occur. Default value is "None".
            cond (str): Whether the condition should be on the trueness of `cond_col` ("y") or falseness of `cond_col` (any other value). Default value is "y".
            DateFormat (str): Format of date to convert to. Default value is "yyyy-dd-MM".

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with modified column `colname`. This column is now converted to the user's date format.
    """

    res = sdf
    if cond_col == None:
        res = res.withColumn(colname, date_format(colname, DateFormat))
    else:
        if cond_col in res.columns:
            if cond.lower() != "y":
                res = res.withColumn(
                    colname,
                    when(~col(cond_col), date_format(colname, DateFormat)).otherwise(
                        col(colname)
                    ),
                )
            else:
                res = res.withColumn(
                    colname,
                    when(col(cond_col), date_format(colname, DateFormat)).otherwise(
                        col(colname)
                    ),
                )
    return res

# COMMAND ----------

def ParseTime(sdf: DataFrame, colname: str, cond_col: str = None, cond: str = "y", TimeFormat="YYYY-MM-DD HH:MI:SS") -> DataFrame:
    """
    Parses the input column (should contain date and time) based on the input date and time format.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (str): Name of column to be modified.
            cond_col (str): Name of column to be conditioned on. If no value is given, no conditioning will occur. Default value is "None".
            cond (str): Whether the condition should be on the trueness of `cond_col` ("y") or falseness of `cond_col` (any other value). Default value is "y".
            TimeFormat (str): Format of date and time inherent in the input column. Default value is 'YYYY-MM-DD HH:MI:SS.

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with modified column `colname`. This column is now parsed to the user's date and time format.
    """

    res = sdf
    if cond_col == None:
        res = res.withColumn(colname, to_timestamp(colname, TimeFormat))
    else:
        if cond_col in res.columns:
            if cond.lower() != "y":
                res = res.withColumn(
                    colname,
                    when(~col(cond_col), to_timestamp(colname, TimeFormat)).otherwise(
                        col(colname)
                    ),
                )
            else:
                res = res.withColumn(
                    colname,
                    when(col(cond_col), to_timestamp(colname, TimeFormat)).otherwise(
                        col(colname)
                    ),
                )
    return res


In [0]:
def clean_names_prepare(sdf: DataFrame, colname: str) -> DataFrame:
    """
    Cleans the input column (contains names). The allowed characters for names are alphabets, digits, and whitespaces. The operations done on the name columns are akin to those used in the notebook `Resa-CleanFirstLastName`.
        Target Sources: 
            `rbe`, `rcsl`, `crm`, `salesforce`.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (List[str]): Name of column to be modified (contains names).
            
        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with cleaned column `colname` 
            (contains names). 
   """

    res = sdf
    #res = ASCIItoUnicode(sdf = res, colname = colname)
    unwanted_char_pattern = "[^a-zA-Z\s]"
    res = RemoveUnwantedChars(sdf = res, colname = colname, unwanted_char_pattern = unwanted_char_pattern)
    res = RemoveWhitespace(sdf = res, colname = colname)
    res = HandleNullVals(sdf = res, colname = colname)
    res = res.withColumn(colname, initcap(colname))
    return res

# COMMAND ----------

def clean_names(sdf: DataFrame, colnames: List[str] = ["firstName", "lastName", "middleName"]) -> DataFrame: 
    """
    Cleans the columns containing first name and last name (in our case, they are [`firstname`, `lastname`]) of a given spark dataframe using the function `clean_names_prepare`.

        Target Sources: 
            `rbe`, `rcsl`, `crm`, `salesforce`.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colnames (str list): List of the names of the columns to be modified (contain first name and last name).


        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with the columns containing names cleaned.
    """ 
    
    res = sdf
    name_cols = [c for c in res.columns if any(word == c for word in colnames)]
    if len(name_cols) != 0:
        for colname in name_cols:
            res = clean_names_prepare(sdf = res, colname = colname)
    res = res.dropDuplicates(colnames)
    return(res)    


def clean_social_nums(sdf: DataFrame, colname: str) -> DataFrame:
    """
    Cleans the input column (contains phone numbers). The allowed characters for phone numbers are digits and symbols `(`, `)`, `+`. 
    But after more parsing, the symbols are also removed and only digits are remained. There are some rows which only contain country codes, and for removal of them, any row 
    with length less than 4 is replaced with `null`.

        Parameters:
            sdf (pyspark.sql.dataframe.DataFrame): Spark Dataframe.
            colname (str): Name of column to be modified (contains phone numbers).

        Returns:
            res (pyspark.sql.dataframe.DataFrame): The input spark dataframe `sdf` but with cleaned column `colname` 
            (contains phone numbers). 
   """

    res = sdf
    unwanted_char_pattern = "[^\d]"
    res = RemoveUnwantedChars(sdf = res, colname = colname, unwanted_char_pattern = unwanted_char_pattern)

    res = HandleNullVals(sdf = res, colname = colname)
    res = res.withColumn(colname, when( (col(colname).isNotNull()) & (length(col(colname)) < 4), lit(None)).otherwise(col(colname)))  
    return res
    


In [0]:
cleaned_names = clean_names(BronzeDF)

In [0]:
cleaned_names.count()

Out[43]: 100000

In [0]:
cleaned_socialNums = clean_social_nums(sdf = cleaned_names, colname = "ssn")
display(cleaned_socialNums)

firstName,middleName,lastName,gender,birthDate,salary,ssn
,Chong,Agnelli,F,1974-01-10,22269,938595511
,Rosalinda,Alimento,F,1947-12-04,109918,947516902
,Tyisha,Baudoin,F,1981-09-07,166351,923289271
,Raisa,Biamonte,F,1959-01-11,138759,959921341
,Rufina,Dahlstrom,F,1918-01-14,167014,915414485
,Yajaira,Hartsch,F,1969-10-06,106148,976638345
,Stella,Kranawetter,F,1922-06-15,243015,976506588
,Machelle,Mahaffey,F,1958-01-26,88751,998784171
,Faye,Maravilla,F,1917-08-21,33679,961848209
,Johanna,Puma,F,2002-10-12,252315,948437416


In [0]:
silverDF = cleaned_socialNums

In [0]:
(silverDF
 .write
 .format("delta")
 .mode("overwrite")
 .save(destFile+"delta")
)

In [0]:
destFile

Out[47]: 'dbfs:/user/hamed.vaheb@pwc.lu/dbacademy/spark_programming/asp_3_4_review/people.parquet/'

In [0]:
destFile

Out[48]: 'dbfs:/user/hamed.vaheb@pwc.lu/dbacademy/spark_programming/asp_3_4_review/people.parquet/'

In [0]:
deltaDestDir = destFile+"delta"

In [0]:
display(dbutils.fs.ls(deltaDestDir))

path,name,size,modificationTime
dbfs:/user/hamed.vaheb@pwc.lu/dbacademy/spark_programming/asp_3_4_review/people.parquet/delta/_delta_log/,_delta_log/,0,1689439042000
dbfs:/user/hamed.vaheb@pwc.lu/dbacademy/spark_programming/asp_3_4_review/people.parquet/delta/part-00000-85aad260-924f-4f4d-9e88-cb01e7793b83-c000.snappy.parquet,part-00000-85aad260-924f-4f4d-9e88-cb01e7793b83-c000.snappy.parquet,2818937,1689439045000


In [0]:
deltaDestDir

Out[51]: 'dbfs:/user/hamed.vaheb@pwc.lu/dbacademy/spark_programming/asp_3_4_review/people.parquet/delta'

**CHECK YOUR WORK**

In [0]:
verify_files = dbutils.fs.ls(deltaDestDir)
verify_delta_format = False
verify_num_data_files = 0
for f in verify_files:
    if f.name == '_delta_log/':
        verify_delta_format = True
    elif f.name.endswith('.parquet'):
        verify_num_data_files += 1

assert verify_delta_format, "Data not written in Delta format"
assert verify_num_data_files == 1, "Expected 1 data file written"

verify_record_count = spark.read.format("delta").load(deltaDestDir).count()
assert verify_record_count == 100000, "Expected 100000 records in final result"

del verify_files, verify_delta_format, verify_num_data_files, verify_record_count

## Clean up classroom
Run the cell below to clean up resources.

In [0]:
%run "./Includes/Classroom-Cleanup"

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>