<a href="https://colab.research.google.com/gist/abdelhaqs/faf23f8a04f7f76e139beabd6e0af3de/built_in_functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Built-in functions

## Download and install Spark

## Setup environment

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp,col,lit

In [3]:
# Create SparkSession
spark = SparkSession.builder\
             .appName("spark-app-version-x")\
             .getOrCreate()

## Downloading and preprocessing Chicago's Reported Crime Data

In [5]:
from pyspark.sql.functions import to_timestamp,col,lit
local_file = '../datasets/csv/'
rc = spark.read.csv(local_file,header=True)\
.withColumn('Date',to_timestamp(col('Date'),'MM/dd/yyyy hh:mm:ss a'))\
.filter(col('Date') < lit('2024-08-01'))

rc.show(5)

+--------+-----------+-------------------+--------------------+----+-------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|      ID|Case Number|               Date|               Block|IUCR|       Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|
+--------+-----------+-------------------+--------------------+----+-------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|13551060|   JH372750|2024-07-31 23:59:00|011XX S FRANCISCO...|0910|MOTOR VEHICLE THEFT|          AUTOMOBILE|              STREET|

## Built-in functions

In [6]:
from pyspark.sql import functions

In [7]:
print(dir(functions))



## String functions

**Display the Primary Type column in lower and upper characters, and the first 4 characters of the column**

In [8]:
from pyspark.sql.functions import lower, upper, substring

In [9]:
help(lower)

Help on function lower in module pyspark.sql.functions:

lower(col: 'ColumnOrName') -> pyspark.sql.column.Column
    Converts a string expression to lower case.
    
    .. versionadded:: 1.5.0
    
    .. versionchanged:: 3.4.0
        Supports Spark Connect.
    
    Parameters
    ----------
    col : :class:`~pyspark.sql.Column` or str
        target column to work on.
    
    Returns
    -------
    :class:`~pyspark.sql.Column`
        lower case values.
    
    Examples
    --------
    >>> df = spark.createDataFrame(["Spark", "PySpark", "Pandas API"], "STRING")
    >>> df.select(lower("value")).show()
    +------------+
    |lower(value)|
    +------------+
    |       spark|
    |     pyspark|
    |  pandas api|
    +------------+



In [10]:
rc.select(lower(col('Primary Type')), upper(col('Primary Type')),substring(col('Primary Type'), 1,4)).show(5)

+-------------------+-------------------+-----------------------------+
|lower(Primary Type)|upper(Primary Type)|substring(Primary Type, 1, 4)|
+-------------------+-------------------+-----------------------------+
|motor vehicle theft|MOTOR VEHICLE THEFT|                         MOTO|
|              theft|              THEFT|                         THEF|
|      other offense|      OTHER OFFENSE|                         OTHE|
|    criminal damage|    CRIMINAL DAMAGE|                         CRIM|
|            battery|            BATTERY|                         BATT|
+-------------------+-------------------+-----------------------------+
only showing top 5 rows



## Numeric functions


**Show the oldest date and the most recent date**

In [11]:
from pyspark.sql.functions import min, max, date_sub

In [12]:
rc.select(max(col('Date'))).show()

+-------------------+
|          max(Date)|
+-------------------+
|2024-07-31 23:59:00|
+-------------------+



In [13]:
rc.select(min(col('Date'))).show()

+-------------------+
|          min(Date)|
+-------------------+
|2024-01-01 00:00:00|
+-------------------+



##Date

** What is 3 days earlier that the oldest date and 3 days later than the most recent date?**

In [14]:
rc.select(date_sub( min(col('Date')),3),date_sub( max(col('Date')),3) ).show()

+----------------------+----------------------+
|date_sub(min(Date), 3)|date_sub(max(Date), 3)|
+----------------------+----------------------+
|            2023-12-29|            2024-07-28|
+----------------------+----------------------+

