# Built-in Functions

PySpark comes with built-in functions which are available in the `pyspark.sql.functions` library.

## Imports


In [1]:
import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession

## SparkSession

In [2]:
spark = (SparkSession
             .builder
             .getOrCreate())

## Display

To allow the browser to display scrollable dataframes.

In [3]:
from IPython.core.display import HTML
display(HTML("<style>pre {white-space: pre !important; }</style>"))

## Load the data

In [4]:
import os
from pyspark.sql.functions import to_timestamp, col, lit

data_path = 'file:///' + os.getcwd() + '/data'

file_path = data_path + '/reported-crimes.csv'

crimes_df = (spark.read
    .option("header", "true")
    .csv(file_path)
    .withColumn("Date", to_timestamp(col("Date"), "MM/dd/yyyy hh:mm:ss a"))
    .filter(col("Date") <= lit("2018-11-11"))
)

crimes_df.show(5)

[Stage 0:>                                                          (0 + 1) / 1]                                                                                

+--------+-----------+-------------------+--------------------+----+------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|      ID|Case Number|               Date|               Block|IUCR|Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|
+--------+-----------+-------------------+--------------------+----+------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|10224738|   HY411648|2015-09-05 13:30:00|     043XX S WOOD ST|0486|     BATTERY|DOMESTIC BATTERY ...|           RESIDENCE| false|    true|0924|     00

## Built-in functions

In [5]:
from pyspark.sql import functions

In [6]:
print(dir(functions))



## String functions

**Display the Primary Type column in lower and upper characters, and the first 4 characters of the column**

In [7]:
from pyspark.sql.functions import lower, upper, substring

In [8]:
crimes_df.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: timestamp (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: string (nullable = true)
 |-- Domestic: string (nullable = true)
 |-- Beat: string (nullable = true)
 |-- District: string (nullable = true)
 |-- Ward: string (nullable = true)
 |-- Community Area: string (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: string (nullable = true)
 |-- Y Coordinate: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Updated On: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)
 |-- Location: string (nullable = true)



In [9]:
crimes_df.select(
    lower(col('Primary Type')), 
    upper(col('Primary Type')), 
    substring(col('Primary Type'), 1, 4)).show(5)

+-------------------+-------------------+-----------------------------+
|lower(Primary Type)|upper(Primary Type)|substring(Primary Type, 1, 4)|
+-------------------+-------------------+-----------------------------+
|            battery|            BATTERY|                         BATT|
|              theft|              THEFT|                         THEF|
|              theft|              THEFT|                         THEF|
|          narcotics|          NARCOTICS|                         NARC|
|            assault|            ASSAULT|                         ASSA|
+-------------------+-------------------+-----------------------------+
only showing top 5 rows



## Numeric functions

**Show the oldest date and the most recent date**

In [10]:
from pyspark.sql.functions import min, max

In [11]:
crimes_df.select(min(col('Date')), max(col('Date'))).show(1)



+-------------------+-------------------+
|          min(Date)|          max(Date)|
+-------------------+-------------------+
|2001-01-01 00:00:00|2018-11-11 00:00:00|
+-------------------+-------------------+



                                                                                

## Date

**What is 3 days earlier than the oldest date and 3 days later than the most recent date?**

In [12]:
from pyspark.sql.functions import date_add, date_sub

In [13]:
crimes_df.select(date_sub(min(col('Date')), 3), date_add(max(col('Date')), 3)).show(1)



+----------------------+----------------------+
|date_sub(min(Date), 3)|date_add(max(Date), 3)|
+----------------------+----------------------+
|            2000-12-29|            2018-11-14|
+----------------------+----------------------+



