# Part IV: Data Schemas is Spark

This notebook covers some technical aspects of spark handles schemas from given data objects.

We will look at:
1. Inferred schemas which are automatically discovered by Spark
2. Explicit schemas which we provide to Spark 

**Data Scenario**
For this notebook we won't be using the data we gathered from yfinance. Instead, this scenario has us looking at client data. Customer data is important to every business and Spark is an excellent tool to analyze client data that may be very large.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import Row
from pyspark.sql.functions import regexp_replace, split

In [0]:
spark = SparkSession.builder.appName("Spark Schemas").getOrCreate()

## Using Databricks

Databricks makes it really easy to get move external data (like a csv) into the Databricks enviroment. 

You can use the GUI to set all the properties needed and even get a preview of the data before you create a new table. Below we will be using pyspark and setting the read properties in our notebook.

If you are running your own Spark enviroment, you will load the data from the local directory.

In [0]:
# File location and type
file_location = "/FileStore/tables/MOCK_CLIENT_DATA.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

df.show(10)

+---+------------+---------+--------------------+---------------+--------------+---------------+---------------+---------------+
| id|  first_name|last_name|               email|account_balance|average_return|portfolio_theme|trade_frequency|last_contact_dt|
+---+------------+---------+--------------------+---------------+--------------+---------------+---------------+---------------+
|  1|        Bria|     Ruse|     bruse0@lulu.com|     $558243.81|         -4.53|           core|         Weekly|     2022-11-25|
|  2|       Merry|   Effemy| meffemy1@paypal.com|       $3739.01|          8.09|     smart-beta|         Seldom|     2022-07-20|
|  3|     Chrisse| Iannello|ciannello2@tinypi...|     $347550.67|         -3.58|     smart-beta|          Often|     2022-12-16|
|  4|Lorettalorna| Bonnette|lbonnette3@dropbo...|     $250191.19|          4.37|           core|         Yearly|     2022-05-12|
|  5|     Maxwell| Spellicy|mspellicy4@freewe...|     $705098.95|         -0.22|    sustainable| 

In [0]:
#here we can view the inferred schema
df.schema

Out[32]: StructType([StructField('id', IntegerType(), True), StructField('first_name', StringType(), True), StructField('last_name', StringType(), True), StructField('email', StringType(), True), StructField('account_balance', StringType(), True), StructField('average_return', DoubleType(), True), StructField('portfolio_theme', StringType(), True), StructField('trade_frequency', StringType(), True), StructField('last_contact_dt', DateType(), True), StructField('numeric_balance', DoubleType(), True)])

You'll notice we let Spark infer the schema of our data. It was able to determine that id is an integer for example and last_contact_dt is a date. You may also notice that account_balance is a string. Why is this?

Spark decided this coulmn is String data type because the data contains the "$" special character. So even though we would treat this data as a decimal number, Spark is smart enough to know that from a programming perspective, this data has to be a string. 

## Changing Data Types
Because we know we will be performing numerical calculations on the account_balance column, let's go ahead and create a new column that is the data type and format we need. 

First we will have to remove the "$" special character.

**Maintaining Data Quality** <br>
Something we should consider before we start changing this data is what are the impacts. My removing the special character what information could be lost? 

What we can do to preserve the integrity of the data is when we remove the currency sign, we create another new column that contains the currency type. In this case when we remove a "$" we would have an entry of "USD" in the currency type column. 

Since we are creating a new column anyway for the account_balance and preserving the original, this isn't a requirement, but it is good to keep in mind.

In [0]:
df = df.withColumn(
    "numeric_balance", regexp_replace("account_balance", "\$", "").cast("double")
)
df.show(5)

+---+------------+---------+--------------------+---------------+--------------+---------------+---------------+---------------+---------------+
| id|  first_name|last_name|               email|account_balance|average_return|portfolio_theme|trade_frequency|last_contact_dt|numeric_balance|
+---+------------+---------+--------------------+---------------+--------------+---------------+---------------+---------------+---------------+
|  1|        Bria|     Ruse|     bruse0@lulu.com|     $558243.81|         -4.53|           core|         Weekly|     2022-11-25|      558243.81|
|  2|       Merry|   Effemy| meffemy1@paypal.com|       $3739.01|          8.09|     smart-beta|         Seldom|     2022-07-20|        3739.01|
|  3|     Chrisse| Iannello|ciannello2@tinypi...|     $347550.67|         -3.58|     smart-beta|          Often|     2022-12-16|      347550.67|
|  4|Lorettalorna| Bonnette|lbonnette3@dropbo...|     $250191.19|          4.37|           core|         Yearly|     2022-05-12|  

Nice, we now have our data ready for analysis. This was pretty easy since Spark was able to infer pretty accurately the data types in our file. But what if we want more control over the data types. For example, in monetary calculations it is almost always best to use the BigDecimal data type from Java. This is because when dealing with money precision is a top concern. 

The downside to using BigDecimal is it can be slower.

## Explicit Schemas

When defining a schema explicitly, we must provide the column name, the data type, and whether the column can be nullable.

In [0]:
#import our data types
from pyspark.sql.types import DecimalType, StructType, StructField, StringType, IntegerType, DateType, FloatType

In [0]:
# define our schema
fields = [
    # Column name, Data Type, Is Nullable
    StructField("id", IntegerType(), True),
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("account_balance", StringType(), True),
    StructField("average_return", FloatType(), True),
    StructField("portfolio_theme", StringType(), True),
    StructField("trade_frequency", StringType(), True),
    StructField("last_contact_dt", DateType(), True),
    StructField("numeric_balance", DecimalType(), True),
]

explicit_schema = StructType(fields)

In [0]:
# File location and type
file_location = "/FileStore/tables/MOCK_CLIENT_DATA.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = (
    spark.read.format(file_type)
    .option("inferSchema", infer_schema)
    .option("header", first_row_is_header)
    .option("sep", delimiter)
    .schema(explicit_schema)
    .load(file_location)
)

df.show(10)

+---+------------+---------+--------------------+---------------+--------------+---------------+---------------+---------------+---------------+
| id|  first_name|last_name|               email|account_balance|average_return|portfolio_theme|trade_frequency|last_contact_dt|numeric_balance|
+---+------------+---------+--------------------+---------------+--------------+---------------+---------------+---------------+---------------+
|  1|        Bria|     Ruse|     bruse0@lulu.com|     $558243.81|         -4.53|           core|         Weekly|     2022-11-25|           null|
|  2|       Merry|   Effemy| meffemy1@paypal.com|       $3739.01|          8.09|     smart-beta|         Seldom|     2022-07-20|           null|
|  3|     Chrisse| Iannello|ciannello2@tinypi...|     $347550.67|         -3.58|     smart-beta|          Often|     2022-12-16|           null|
|  4|Lorettalorna| Bonnette|lbonnette3@dropbo...|     $250191.19|          4.37|           core|         Yearly|     2022-05-12|  

Hmmm, now why is numeric_balance null? Remember, the numeric_balance column doesn't exist in our source data. We had to create it ourselves.

Spark took our word for it that the data did contain that column but because no data actually existed, Spark just initialized a null column. Let's go ahead and create our numberic_balance column again and set the data type to DecimalType.

In [0]:
df = df.withColumn(
    "numeric_balance",
    regexp_replace("account_balance", "\$", "").cast(DecimalType(18, 2)),
)
df.schema

Out[57]: StructType([StructField('id', IntegerType(), True), StructField('first_name', StringType(), True), StructField('last_name', StringType(), True), StructField('email', StringType(), True), StructField('account_balance', StringType(), True), StructField('average_return', FloatType(), True), StructField('portfolio_theme', StringType(), True), StructField('trade_frequency', StringType(), True), StructField('last_contact_dt', DateType(), True), StructField('numeric_balance', DecimalType(18,2), True)])

In [0]:
df.show(5)

+---+------------+---------+--------------------+---------------+--------------+---------------+---------------+---------------+---------------+
| id|  first_name|last_name|               email|account_balance|average_return|portfolio_theme|trade_frequency|last_contact_dt|numeric_balance|
+---+------------+---------+--------------------+---------------+--------------+---------------+---------------+---------------+---------------+
|  1|        Bria|     Ruse|     bruse0@lulu.com|     $558243.81|         -4.53|           core|         Weekly|     2022-11-25|      558243.81|
|  2|       Merry|   Effemy| meffemy1@paypal.com|       $3739.01|          8.09|     smart-beta|         Seldom|     2022-07-20|        3739.01|
|  3|     Chrisse| Iannello|ciannello2@tinypi...|     $347550.67|         -3.58|     smart-beta|          Often|     2022-12-16|      347550.67|
|  4|Lorettalorna| Bonnette|lbonnette3@dropbo...|     $250191.19|          4.37|           core|         Yearly|     2022-05-12|  

And that's it! Now we know how to work with Schemas and data types in Spark.

You can read more about Spark's datatypes in the documentation here -> [Spark.apache.org/docs/data_types](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/data_types.html)