### Library Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

### Template

In [2]:
spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 1.4 - Decimals and Why did my Decimals overflow")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

def get_csv_schema(*args):
    return T.StructType([
        T.StructField(*arg)
        for arg in args
    ])

def read_csv(fname, schema):
    return spark.read.csv(
        path=fname,
        header=True,
        schema=get_csv_schema(*schema)
    )

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path

### Decimals and Why did my Decimals overflow

Some cases where you would deal with `Decimal` types are if you are talking about money, height, weight, etc. Working with `Decimal` types may appear simple at first but there are some nuances that will sneak up behind you. We will go through some ways to get around these as they are hard to debug.

Here is some simple jargon that we will use in the following examples:
* `Integer`: The set of numbers including all the whole numbers and their opposites (the positive whole numbers, the negative whole numbers, and zero). ie. -1, 0, 1, 2, etc.
* `Irrational Number`: The set including all numbers that are non- terminating, non- repeating decimals. ie. 2.1, 10.5, etc.
* `Precision`: the maximum total number of digits.
* `Scale`: the number of digits on the right of dot.

Source:
* [link](https://www.sparknotes.com/math/prealgebra/integersandrationals/terms/)
* [link](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.types.DecimalType)

### Case 1: Working With `Decimal`s in Python

In [3]:
print("Example 1 - {}".format(Decimal(20)))
print("Example 2 - {}".format(Decimal("20.2")))
print("Example 3 - {}".format(Decimal(20.5)))
print("Example 4 - {}".format(Decimal(20.2)))

Example 1 - 20
Example 2 - 20.2
Example 3 - 20.5
Example 4 - 20.199999999999999289457264239899814128875732421875


**What Happened?**

Let's break down the examples above.

Example 1:

Here we provided a whole number, so nothing special.

Example 2:

Here we provided a string representing an irrational number. The `precision` and `scale` were preserved.

Example 3:

Here we provided an irrational number. The `precision` and `scale` were preserved.

Example 4:

Here we provided an irrational number, but this isn't what we expected? Well this is because it's impossible to provide an exact representation of `20.2` on the computer! If you want to know more about this you can look up "IEEE floating point representation". We will keep this in mind for later on.

### Case 2: Reading in Decimals in Spark (Incorrectly)

In [4]:
pets = read_csv(
    fname=path,
    schema=[
        ("id", T.LongType(), False),
        ("breed_id", T.LongType(), True),
        ("nickname", T.StringType(), True),
        ("birthday", T.TimestampType(), True),
        ("age", T.LongType(), True),
        ("color", T.StringType(), True),
        ("weight", T.DecimalType(), True),
    ]
)
pets.show()

+---+--------+--------+-------------------+---+-----+------+
| id|breed_id|nickname|           birthday|age|color|weight|
+---+--------+--------+-------------------+---+-----+------+
|  1|       1|    King|2014-11-22 12:30:31|  5|brown|    10|
|  2|       3|   Argus|2016-11-22 10:05:10| 10| null|     6|
|  3|       1|  Chewie|2016-11-22 10:05:10| 15| null|    12|
|  3|       2|   Maple|2018-11-22 10:05:10| 17|white|     3|
|  4|       2|    null|2019-01-01 10:05:10| 13| null|    10|
+---+--------+--------+-------------------+---+-----+------+



**What Happened?**

What happened to our `scalar` values, they weren't read in? This is because the default arguments to the `T.Decimal()` function are `DecimalType(precision=10, scale=0)`. So to read in the data correctly we need to override these default arguments.

### Case 2: Reading in Decimals in Spark (Correctly)

In [5]:
pets = read_csv(
    fname=path,
    schema=[
        ("id", T.LongType(), False),
        ("breed_id", T.LongType(), True),
        ("nickname", T.StringType(), True),
        ("birthday", T.TimestampType(), True),
        ("age", T.LongType(), True),
        ("color", T.StringType(), True),
        ("weight", T.DecimalType(10,2), True),
    ]
)
pets.show()

+---+--------+--------+-------------------+---+-----+------+
| id|breed_id|nickname|           birthday|age|color|weight|
+---+--------+--------+-------------------+---+-----+------+
|  1|       1|    King|2014-11-22 12:30:31|  5|brown| 10.00|
|  2|       3|   Argus|2016-11-22 10:05:10| 10| null|  5.50|
|  3|       1|  Chewie|2016-11-22 10:05:10| 15| null| 12.00|
|  3|       2|   Maple|2018-11-22 10:05:10| 17|white|  3.40|
|  4|       2|    null|2019-01-01 10:05:10| 13| null| 10.00|
+---+--------+--------+-------------------+---+-----+------+



### Case 3: Reading in Large Decimals in Spark

In [6]:
spark.createDataFrame(
    data=[
        (100,),
        (2 ** 63,)
    ],
    schema=['data']
).show()

+----+
|data|
+----+
| 100|
|null|
+----+



**What Happened?**

Why is the second value null? The second value overflows the max value of a decimal and never makes it to Spark (Scala). 

If you see this error then you will need to check your input data as there might be something wrong there.

### Case 3: Setting Values in a DataFrame (Incorrectly)

In [7]:
(
    pets
    .withColumn('decimal_column', F.lit(Decimal(20.2)))
    .show()
)

AnalysisException: u'DecimalType can only support precision up to 38;'

**What Happened?**

Remember our python examples above? Well because the `precision` of the Spark `T.DecimalType` is 38 digits, the value went over the maximum value of the Spark type.

### Case 3: Setting Values in a DataFrame (Correctly)

In [8]:
(
    pets
    .withColumn('decimal_column', F.lit(Decimal("20.2")))
    .show()
)

+---+--------+--------+-------------------+---+-----+------+--------------+
| id|breed_id|nickname|           birthday|age|color|weight|decimal_column|
+---+--------+--------+-------------------+---+-----+------+--------------+
|  1|       1|    King|2014-11-22 12:30:31|  5|brown| 10.00|          20.2|
|  2|       3|   Argus|2016-11-22 10:05:10| 10| null|  5.50|          20.2|
|  3|       1|  Chewie|2016-11-22 10:05:10| 15| null| 12.00|          20.2|
|  3|       2|   Maple|2018-11-22 10:05:10| 17|white|  3.40|          20.2|
|  4|       2|    null|2019-01-01 10:05:10| 13| null| 10.00|          20.2|
+---+--------+--------+-------------------+---+-----+------+--------------+



**What Happened?**

If we provide the irrational number as a string, this solves the problem.

### Case 4: Performing Arthimetrics with `DecimalType`s (Incorrectly)

In [9]:
pets = spark.createDataFrame(
    data=[
        (Decimal('113.790000000000000000'), Decimal('2.54')),
        (Decimal('113.790000000000000000'), Decimal('2.54')),
    ],
    schema=['weight_in_kgs','conversion_to_lbs']
)

pets.show()

+--------------------+--------------------+
|       weight_in_kgs|   conversion_to_lbs|
+--------------------+--------------------+
|113.7900000000000...|2.540000000000000000|
|113.7900000000000...|2.540000000000000000|
+--------------------+--------------------+



In [10]:
(
    pets
    .withColumn('weight_in_lbs', F.col('weight_in_kgs') * F.col('conversion_to_lbs'))
    .show()
)

+--------------------+--------------------+-------------+
|       weight_in_kgs|   conversion_to_lbs|weight_in_lbs|
+--------------------+--------------------+-------------+
|113.7900000000000...|2.540000000000000000|   289.026600|
|113.7900000000000...|2.540000000000000000|   289.026600|
+--------------------+--------------------+-------------+



**What Happened?**

This used to overflow... Guess they updated it 😅.

### Case 4: Performing Arthimetrics Operations with DecimalTypes (Correctly)

In [11]:
(
    pets
    .withColumn('weight_in_kgs', F.col('weight_in_kgs').cast('Decimal(20,2)'))
    .withColumn('conversion_to_lbs', F.col('conversion_to_lbs').cast('Decimal(20,2)'))
    .withColumn('weight_in_lbs', F.col('weight_in_kgs') * F.col('conversion_to_lbs'))
    .show()
)

+-------------+-----------------+-------------+
|weight_in_kgs|conversion_to_lbs|weight_in_lbs|
+-------------+-----------------+-------------+
|       113.79|             2.54|     289.0266|
|       113.79|             2.54|     289.0266|
+-------------+-----------------+-------------+



**What Happened?**

Before doing the calculations, we truncated (with the help of the `cast` function, which we will learn about in the later chapters) all of the values to be only 2 `scalar` digits at most. This is how you should perform your arithmetic operations with `Decimal Types`. Ideally you should know the minimum number of `scalar` digits needed for each datatype.

### Summary

* We learned that you should always initial `Decimal` types using string represented numbers, if they are an Irrational Number.
* When reading in `Decimal` types, you should explicitly override the default arguments of the Spark type and make sure that the underlying data is correct.
* When performing arithmetic operations with decimal types you should always truncate the scalar digits to the lowest number of digits as possible, if you haven't already.