# Built-in functions

## Download and install Spark

In [5]:
!ls

sample_data


In [6]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
!tar xf spark-2.3.1-bin-hadoop2.7.tgz
!pip install -q findspark

0% [Working]            Hit:1 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
0% [Waiting for headers] [Waiting for headers] [Connected to cloud.r-project.or                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
                                                                               Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
                                                                               Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
                                                                               Get:5 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
0% [3 InRelease 18.5 kB/88.7 kB 21%] [4 InRelease 53.3 kB/88.7 kB 60%] [Waiting                                                                               Hit:6 http://ppa.launchpad.net/cran/libg

## Setup environment

In [7]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 
spark

## Downloading and preprocessing Chicago's Reported Crime Data

In [8]:
!wget https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
#!ls -l

--2022-06-16 20:06:44--  https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
Resolving data.cityofchicago.org (data.cityofchicago.org)... 52.206.140.199, 52.206.140.205, 52.206.68.26
Connecting to data.cityofchicago.org (data.cityofchicago.org)|52.206.140.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘rows.csv?accessType=DOWNLOAD’

rows.csv?accessType     [             <=>    ]   1.66G  3.04MB/s    in 9m 24s  

2022-06-16 20:16:09 (3.02 MB/s) - ‘rows.csv?accessType=DOWNLOAD’ saved [1784668933]



In [10]:
!mv rows.csv\?accessType\=DOWNLOAD reported-crimes.csv
!ls -l

mv: cannot stat 'rows.csv?accessType=DOWNLOAD': No such file or directory
total 1963448
-rw-r--r--  1 root root 1784668933 Jun 16 10:53 reported-crimes.csv
drwxr-xr-x  1 root root       4096 Jun 15 13:42 sample_data
drwxrwxr-x 13 1000 1000       4096 Jun  1  2018 spark-2.3.1-bin-hadoop2.7
-rw-r--r--  1 root root  225883783 Jun  1  2018 spark-2.3.1-bin-hadoop2.7.tgz


In [11]:
from pyspark.sql.functions import to_timestamp,col,lit
rc = spark.read.csv('reported-crimes.csv',header=True).withColumn('Date',to_timestamp(col('Date'),'MM/dd/yyyy hh:mm:ss a')).filter(col('Date') < lit('2018-11-12'))
rc.show(5)

+--------+-----------+-------------------+--------------------+----+------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|      ID|Case Number|               Date|               Block|IUCR|Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|
+--------+-----------+-------------------+--------------------+----+------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|10224738|   HY411648|2015-09-05 13:30:00|     043XX S WOOD ST|0486|     BATTERY|DOMESTIC BATTERY ...|           RESIDENCE| false|    true|0924|     00

## Built-in functions

In [7]:
from pyspark.sql import functions

In [8]:
print(dir(functions))



## String functions

**Display the Primary Type column in lower and upper characters, and the first 4 characters of the column**

In [9]:
from pyspark.sql.functions import lower, upper, substring

In [15]:
help(lower)

Help on function lower in module pyspark.sql.functions:

lower(col)
    Converts a string column to lower case.
    
    .. versionadded:: 1.5



In [24]:
rc.select(lower(col('Primary Type')), upper(col('Primary Type')),substring(col('Primary Type'), 1,4)).show(5)

+-------------------+-------------------+-----------------------------+
|lower(Primary Type)|upper(Primary Type)|substring(Primary Type, 1, 4)|
+-------------------+-------------------+-----------------------------+
|            battery|            BATTERY|                         BATT|
|              theft|              THEFT|                         THEF|
|              theft|              THEFT|                         THEF|
|          narcotics|          NARCOTICS|                         NARC|
|            assault|            ASSAULT|                         ASSA|
+-------------------+-------------------+-----------------------------+
only showing top 5 rows



## Numeric functions


**Show the oldest date and the most recent date**

In [12]:
from pyspark.sql.functions import min, max, date_sub

In [38]:
rc.select(max(col('Date'))).show()

+-------------------+
|          max(Date)|
+-------------------+
|2018-11-11 23:50:00|
+-------------------+



In [39]:
rc.select(min(col('Date'))).show()

+-------------------+
|          min(Date)|
+-------------------+
|2001-01-01 00:00:00|
+-------------------+



##Date

** What is 3 days earlier that the oldest date and 3 days later than the most recent date?**

In [14]:
rc.select(date_sub( min(col('Date')),3),date_sub( max(col('Date')),3) ).show()

+----------------------+----------------------+
|date_sub(min(Date), 3)|date_sub(max(Date), 3)|
+----------------------+----------------------+
|            2000-12-29|            2018-11-08|
+----------------------+----------------------+

