# Chicago Crime Data Investigation using PySpark

## Install Spark

In [1]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
!tar xf spark-2.3.1-bin-hadoop2.7.tgz
!pip install -q findspark

0% [Working]            Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [Waiting for headers] [Co                                                                               Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [2 InRelease 2,586 B/88.7                                                                               Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [2 InRelease 14.2 kB/88.7                                                                               Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [2 InRelease 14.2 kB/88.70% [Connecting to archive.ubuntu.com (9

In [2]:
# Seetting up the paths
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

In [None]:
# check the list of files and folders in the current dorectory


international-airline-passengers.csv  spark-2.3.1-bin-hadoop2.7.tgz
reported-crimes.csv		      spark-2.3.1-bin-hadoop2.7.tgz.1
sample_data			      spark-warehouse
spark-2.3.1-bin-hadoop2.7


In [3]:
# import findspark and checking out how to create a spark session
import findspark
findspark.init()
from pyspark import SparkContext

sc = SparkContext.getOrCreate()
sc

In [4]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 
spark

## Downloading and preprocessing Chicago's Reported Crime Data

In [5]:
!wget -q https://data.cityofchicago.org/api/views/qzdf-xmn8/rows.csv?accessType=DOWNLOAD

In [6]:
# check all the files in the directory
!ls

'rows.csv?accessType=DOWNLOAD'	 spark-2.3.1-bin-hadoop2.7
 sample_data			 spark-2.3.1-bin-hadoop2.7.tgz


In [8]:
# rename the file to something simple
!mv rows.csv\?accessType\=DOWNLOAD reported-crimes.csv

In [9]:
# check all the files in the directory
!ls

reported-crimes.csv  spark-2.3.1-bin-hadoop2.7
sample_data	     spark-2.3.1-bin-hadoop2.7.tgz


In [11]:
# loading data as a dataframe
from pyspark.sql.functions import to_timestamp,col,lit
rc = spark.read.csv('reported-crimes.csv',header=True).withColumn('Date',to_timestamp(col('Date'),'MM/dd/yyyy hh:mm:ss a')).filter(col('Date') <= lit('2020-09-01'))

In [13]:
# lets check the top five rows


+--------+-----------+-------------------+-------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|      ID|Case Number|               Date|              Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|
+--------+-----------+-------------------+-------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|12175466|   JD378561|2020-07-01 00:00:00| 050XX N SAWYER AVE|1156|  DECEPTIVE PRACTICE|ATTEMPT - FINANCI...|           APARTMENT|

## Working with columns

**Display only the first 5 rows of the column name IUCR **

+----+
|IUCR|
+----+
|0486|
|0496|
|0486|
|1310|
|0486|
+----+
only showing top 5 rows



+----+
|IUCR|
+----+
|0486|
|0496|
|0486|
|1310|
|0486|
+----+
only showing top 5 rows



  **Display only the first 4 rows of the column names Case Number, Date and Arrest**

+-----------+-------------------+------+
|Case Number|               Date|Arrest|
+-----------+-------------------+------+
|   JC207122|2019-03-31 23:51:00| false|
|   JC207126|2019-03-31 23:50:00|  true|
|   JC207120|2019-03-31 23:47:00| false|
|   JC207203|2019-03-31 23:45:00| false|
|   JC207116|2019-03-31 23:40:00| false|
+-----------+-------------------+------+
only showing top 5 rows



+-----------+-------------------+------+
|Case Number|               Date|Arrest|
+-----------+-------------------+------+
|   JC207122|2019-03-31 23:51:00| false|
|   JC207126|2019-03-31 23:50:00|  true|
|   JC207120|2019-03-31 23:47:00| false|
|   JC207203|2019-03-31 23:45:00| false|
|   JC207116|2019-03-31 23:40:00| false|
+-----------+-------------------+------+
only showing top 5 rows



** Add a column with name One, with entries all 1s **

In [None]:
from pyspark.sql.functions import lit
rc.withColumn('One',lit(1)).show(5)

+--------+-----------+-------------------+--------------------+----+---------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+--------------------------+---------+---------------+-------------+-----+----------------------+----------------+------------+---+
|      ID|Case Number|               Date|               Block|IUCR|   Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|Historical Wards 2003-2015|Zip Codes|Community Areas|Census Tracts|Wards|Boundaries - ZIP Codes|Police Districts|Police Beats|One|
+--------+-----------+-------------------+--------------------+----+---------------+--------------------+--------------------+------+--------+----+--------+----+-------

** Remove the column IUCR **

+--------+-----------+-------------------+--------------------+---------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+--------------------------+---------+---------------+-------------+-----+----------------------+----------------+------------+
|      ID|Case Number|               Date|               Block|   Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|Historical Wards 2003-2015|Zip Codes|Community Areas|Census Tracts|Wards|Boundaries - ZIP Codes|Police Districts|Police Beats|
+--------+-----------+-------------------+--------------------+---------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------

In [None]:
rc.count()

6841687

## Working with rows

**Add the reported crimes for an additional day, 01-April-2019, to our dataset.**

8

In [None]:
rc.union(one_day).show(5)

+--------+-----------+-------------------+--------------------+----+---------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+--------------------------+---------+---------------+-------------+-----+----------------------+----------------+------------+
|      ID|Case Number|               Date|               Block|IUCR|   Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|Historical Wards 2003-2015|Zip Codes|Community Areas|Census Tracts|Wards|Boundaries - ZIP Codes|Police Districts|Police Beats|
+--------+-----------+-------------------+--------------------+----+---------------+--------------------+--------------------+------+--------+----+--------+----+--------------+

6841695

6841687

+--------+-----------+-------------------+--------------------+----+--------------------+--------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+--------------------------+---------+---------------+-------------+-----+----------------------+----------------+------------+
|      ID|Case Number|               Date|               Block|IUCR|        Primary Type|   Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|Historical Wards 2003-2015|Zip Codes|Community Areas|Census Tracts|Wards|Boundaries - ZIP Codes|Police Districts|Police Beats|
+--------+-----------+-------------------+--------------------+----+--------------------+--------------+--------------------+------+--------+----+--------+----+--------------+---

**What are the top 10 number of reported crimes by Primary type, in descending order of occurence?**

+--------------------+-------+
|        Primary type|  count|
+--------------------+-------+
|OFFENSE INVOLVING...|  46274|
|            STALKING|   3468|
|PUBLIC PEACE VIOL...|  48315|
|           OBSCENITY|    604|
|NON-CRIMINAL (SUB...|      9|
|               ARSON|  11283|
|   DOMESTIC VIOLENCE|      1|
|            GAMBLING|  14438|
|   CRIMINAL TRESPASS| 195803|
|             ASSAULT| 425573|
|      NON - CRIMINAL|     38|
|LIQUOR LAW VIOLATION|  14130|
| MOTOR VEHICLE THEFT| 317694|
|               THEFT|1440495|
|             BATTERY|1249262|
|             ROBBERY| 258609|
|            HOMICIDE|   9593|
|           RITUALISM|     23|
|    PUBLIC INDECENCY|    164|
| CRIM SEXUAL ASSAULT|  27851|
+--------------------+-------+
only showing top 20 rows



+--------------------+-------+
|        Primary type|  count|
+--------------------+-------+
|               THEFT|1440495|
|             BATTERY|1249262|
|     CRIMINAL DAMAGE| 780494|
|           NARCOTICS| 716461|
|             ASSAULT| 425573|
|       OTHER OFFENSE| 425091|
|            BURGLARY| 391699|
| MOTOR VEHICLE THEFT| 317694|
|  DECEPTIVE PRACTICE| 270421|
|             ROBBERY| 258609|
|   CRIMINAL TRESPASS| 195803|
|   WEAPONS VIOLATION|  72699|
|        PROSTITUTION|  68564|
|PUBLIC PEACE VIOL...|  48315|
|OFFENSE INVOLVING...|  46274|
| CRIM SEXUAL ASSAULT|  27851|
|         SEX OFFENSE|  25612|
|INTERFERENCE WITH...|  15601|
|            GAMBLING|  14438|
|LIQUOR LAW VIOLATION|  14130|
+--------------------+-------+
only showing top 20 rows



## Challenge questions

**What percentage of reported crimes resulted in an arrest?**

0.27657535341795086

  **What are the top 3 locations for reported crimes?**

+--------------------+-------+
|Location Description|  count|
+--------------------+-------+
|              STREET|1790520|
|           RESIDENCE|1159139|
|           APARTMENT| 710861|
|            SIDEWALK| 671590|
|               OTHER| 260468|
+--------------------+-------+
only showing top 5 rows



## Built-in functions

In [None]:
from pyspark.sql import functions

In [None]:
print(dir(functions))



## String functions

**Display the Primary Type column in lower and upper characters, and the first 4 characters of the column**

In [None]:
from pyspark.sql.functions import lower,upper,substring

In [None]:
help('lower')

No Python documentation found for 'lower'.
Use help() to get the interactive help utility.
Use help(str) for help on the str class.

