# Spark Learning Note - Data Types & Data Sources

## API Locations

**Column Methods**
- `pyspark.sql.functions`

**DataFrame methods**: Since DataFrame is just a Dataset of Row objects. Many useful methods are under the `DataSet` module:
- sub-module `DataFrameStatFunctions` for statistical methods
- sub-module `DataFrameNaFunctions` for handling null data

In [1]:
# check java version 
# use sudo update-alternatives --config java to switch java version if needed.
!java -version

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~19.10-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


In [2]:
from pyspark.sql import SparkSession
data_example_path = '/home/jgeng/Documents/Git/SparkLearning/data/retail.csv'

In [17]:
# build a spark session locally
spark = SparkSession.builder.appName('Spark Example').getOrCreate()

# specify the number of worker
spark

In [18]:
df = spark.read.format('csv').option('header', True).option('inferSchema', True).load(data_example_path)

In [19]:
df.show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



### 1. Boolean / Filtering

Booleans are usually for filtering and sometime we need to create boolean column to filter the data.

Some methods for boolean selection/filter from `pyspark.sql.functions`.
Below functions return a Column object, which contains the index information for filtering.
Some column objects are not boolean, need to convert the column object to boolean. Because `filter` and `where` only accept boolean.
- `df.col_name.isin(s or [s1, s2])`: return booleans can be directly used for filtering.
- `instr(col_name, s)`: return the position of the first occurance of the substring s in data (0 if not occur), need to convert to boolean for filtering.
- if use variable for filtering expression, need to use `&` `|` for `and` `or` operation
- **Be careful about null data when creating boolean**. Use `.eqNullSafe()` (more later)


In [20]:
# filtering by value in a specific column
df.where(df.StockCode.isin(['DOT', '71053'])).show()
df.where(df.StockCode.isin('DOT', '71053'))  # equivelent

+---------+---------+-------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|        Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+-------------------+--------+-------------------+---------+----------+--------------+
|   536365|    71053|WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536373|    71053|WHITE METAL LANTERN|       6|2010-12-01 09:02:00|     3.39|   17850.0|United Kingdom|
|   536375|    71053|WHITE METAL LANTERN|       6|2010-12-01 09:32:00|     3.39|   17850.0|United Kingdom|
|   536396|    71053|WHITE METAL LANTERN|       6|2010-12-01 10:51:00|     3.39|   17850.0|United Kingdom|
|   536406|    71053|WHITE METAL LANTERN|       8|2010-12-01 11:33:00|     3.39|   17850.0|United Kingdom|
|   536544|    71053|WHITE METAL LANTERN|       1|2010-12-01 14:32:00|     8.47|      null|United Kingdom|
|   536544|      DOT|     DOTCOM POST

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [21]:
from pyspark.sql.functions import instr

# filter by whether a substring is in a column feature
df.where(instr(df.Description, 'POSTAGE')==8).show(2)  # Postage first appear at pos 8 in DOTCOM POSTAGE
df.where(instr(df.Description, 'POSTAGE')==1).show(2)  # find all rows with Description field that start with POSTAGE
df.where(instr(df.Description, 'POSTAGE')==0).show(2)  # find all rows with Description fielf that does not contain POSTAGE

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+

+---------+---------+-----------+--------+-------------------+---------+----------+-----------+
|InvoiceNo|StockCode|Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|    Country|
+---------+---------+-----------+--------+-------------------+---------+----------+-----------+
|   536370|     POST|    POSTAGE|       3|2010-12-01 08:45:00|     18.0|   12583.0|     France|
|  

In [22]:
# combined filters
filter1 = instr(df.Description, 'POSTAGE')==0
filter2 = df.StockCode.isin(['DOT', '71053'])
df.where(filter1 | filter2).show(2)
df.where(filter1 & filter2).show(2)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 2 rows

+---------+---------+-------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|        Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+-------------------+--------+-------------------+---------+----------+--------------+
|   53

In [25]:
from pyspark.sql.functions import col

# we can also use withColumn() to do the filtering and selection together
# Adds a boolean column -> filter on the boolean column -> select
df.withColumn('isExpensive', col('UnitPrice')>200).where('isExpensive').select('StockCode', 'Quantity', 'UnitPrice').show(3)

+---------+--------+---------+
|StockCode|Quantity|UnitPrice|
+---------+--------+---------+
|      DOT|       1|   569.77|
|      DOT|       1|   607.49|
+---------+--------+---------+



## 2.1 Numbers

Often time we need to manually transform the columns with some defined functions. We can do this by getting the numeric column (`df.col_name`) then apply regular operations on it then add it to the table (use `select` or `withColumn`).
- operations between two numerical column is supported.
- `selectExpr(expressions)` or `select(expr())` can also do this
- use `round(num, digits)` to round number or ROUND in Expr
- `round` will round 2.5 to 3. `bround` will round 2.5 to 2 
- `corr` to compute correlation between two columns (with `selectExpr`). 
    - correlation between two column is a singal value so don't `select` it with other column
    - use `df.stat.corr(col1, col2)` to get scalar values
    
**A very handy method: `df.describe().show()` to show some basic statistic about the data.**
If want to get the specific metric, use functions from `pyspark.sql.functions`: `count()`, `mean()`, `stdev_pop()`, `min()`, `max()`

In [26]:
# modified price
correctPrice = pow((df.UnitPrice * 2 + 2), 2)

# DO NOT drop before select
# we need the column when compute correctPrice
df.select('*', correctPrice.alias('TrueUnitPrice')).drop('UnitPrice').show(2)  

+---------+---------+--------------------+--------+-------------------+----------+--------------+-----------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|CustomerID|       Country|    TrueUnitPrice|
+---------+---------+--------------------+--------+-------------------+----------+--------------+-----------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|   17850.0|United Kingdom|            50.41|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|   17850.0|United Kingdom|77.08840000000002|
+---------+---------+--------------------+--------+-------------------+----------+--------------+-----------------+
only showing top 2 rows



In [27]:
# multiply two numerical is supported 
totalPrice = df.Quantity * df.UnitPrice
df.select('StockCode', totalPrice.alias('TotalPrice')).show(2)

+---------+------------------+
|StockCode|        TotalPrice|
+---------+------------------+
|   85123A|15.299999999999999|
|    71053|             20.34|
+---------+------------------+
only showing top 2 rows



In [28]:
# above can be executed using the selectExpr
df.selectExpr('StockCode', 'ROUND(SQRT(POWER(Quantity * UnitPrice, 2)), 2) as TotalPrice').show(2)

+---------+----------+
|StockCode|TotalPrice|
+---------+----------+
|   85123A|      15.3|
|    71053|     20.34|
+---------+----------+
only showing top 2 rows



In [29]:
# correlation between two columns
df.selectExpr('corr(Quantity, UnitPrice)').show()

# use corr from df.stat
df.stat.corr('Quantity', 'UnitPrice')

+-----------------------------------------+
|corr(CAST(Quantity AS DOUBLE), UnitPrice)|
+-----------------------------------------+
|                     -0.04112314436835551|
+-----------------------------------------+



-0.04112314436835551

In [30]:
# super handy method  
df.select('CustomerID', 'Quantity', 'UnitPrice').describe().show()  # becarefule about the mean

+-------+------------------+------------------+------------------+
|summary|        CustomerID|          Quantity|         UnitPrice|
+-------+------------------+------------------+------------------+
|  count|              1968|              3108|              3108|
|   mean|15661.388719512195| 8.627413127413128| 4.151946589446603|
| stddev|1854.4496996893627|26.371821677029203|15.638659854603892|
|    min|           12431.0|               -24|               0.0|
|    max|           18229.0|               600|            607.49|
+-------+------------------+------------------+------------------+



In [31]:
from pyspark.sql.functions import stddev_pop, mean

df.select(mean('UnitPrice'), stddev_pop('UnitPrice')).show()


+-----------------+---------------------+
|   avg(UnitPrice)|stddev_pop(UnitPrice)|
+-----------------+---------------------+
|4.151946589446603|   15.636143780280698|
+-----------------+---------------------+



## 2.2 Some Other Useful functions

- `df.stat.crosstab(col1, col2)` return a frequency table of paired feature values.
- `freqItems()` to find frequent items for columns, possibly with false positives.
- `df.stat.approxQuantile(colname, [quantiles], relError)` to get the approximated quantile
- `monotonically_increasing_id` from `pyspark.sql.functions` to introduce id into data frame (using `select`)

In [46]:
# cross tab
df.limit(10).stat.crosstab('CustomerID', 'UnitPrice').show()


+--------------------+----+----+----+----+----+----+----+
|CustomerID_UnitPrice|1.69|1.85|2.55|2.75|3.39|4.25|7.65|
+--------------------+----+----+----+----+----+----+----+
|             13047.0|   1|   0|   0|   0|   0|   0|   0|
|             17850.0|   0|   2|   1|   1|   3|   1|   1|
+--------------------+----+----+----+----+----+----+----+



In [65]:
# freq items
df.stat.freqItems(['CustomerID']).show()
print(df.where('CustomerID = 12662').count())
print(df.where('CustomerID = 12868').count())

+--------------------+
|CustomerID_freqItems|
+--------------------+
|[12662.0, 12868.0...|
+--------------------+

15
12


In [68]:
from pyspark.sql.functions import monotonically_increasing_id

# add id to a dataframe
df.select(monotonically_increasing_id().alias('id'), '*').show(5)

+---+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
| id|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|  0|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|  1|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|  2|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|  3|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|  4|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
+---+---------+---------+--------------------+--------+-------------------+---------+----------+--------

In [80]:
quantiles = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]  # must be all float!!!
relError = 0.01
df.select(mean('Quantity')).show()
df.selectExpr('min(Quantity)').show()
df.selectExpr('max(Quantity)').show()

# approximated quantiles

df.stat.approxQuantile('Quantity', quantiles, relError)

+-----------------+
|    avg(Quantity)|
+-----------------+
|8.627413127413128|
+-----------------+

+-------------+
|min(Quantity)|
+-------------+
|          -24|
+-------------+

+-------------+
|max(Quantity)|
+-------------+
|          600|
+-------------+



[-24.0, 1.0, 1.0, 1.0, 2.0, 2.0, 3.0, 6.0, 10.0, 12.0, 600.0]

## 3.1 Strings

## 3.2 Regular Expression

## 4. Datas and Timestampes

## 5. Nulls

## 6. Complex Types