<li>Details - Duration 15 to 20 minutes
<ul>
<li>Data is available in HDFS file system under /public/crime/csv</li>
<li>Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)<br>
File format - text file</li>
<li>Delimiter - “,” (use regex while splitting <code>split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1)</code>, as there are some fields with comma and enclosed using double quotes.</li>
<li>Get top 3 crime types based on number of incidents in RESIDENCE area using “Location Description”</li>
<li>Store the result in HDFS path /user/&lt;YOUR_USER_ID&gt;/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA</li>
<li>Output Fields: Crime Type, Number of Incidents</li>
<li>Output File Format: JSON</li>
<li>Output Delimiter: N/A</li>
<li>Output Compression: No</li>
</ul>
</li>

In [1]:
! hdfs dfs -ls  /public/*

2020-06-08 19:15:02,410 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   2 pi supergroup 1680866151 2020-06-06 17:38 /public/csv/crimes.csv


In [6]:
crimesRDD = sc.textFile('/public/csv')

In [7]:
crimesRDD.take(10)

['ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location',
 '11034701,JA366925,01/01/2001 11:00:00 AM,016XX E 86TH PL,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,false,false,0412,004,8,45,11,,,2001,08/05/2017 03:50:08 PM,,,',
 '11227287,JB147188,10/08/2017 03:00:00 AM,092XX S RACINE AVE,0281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,RESIDENCE,false,false,2222,022,21,73,02,,,2017,02/11/2018 03:57:41 PM,,,',
 '11227583,JB147595,03/28/2017 02:00:00 PM,026XX W 79TH ST,0620,BURGLARY,UNLAWFUL ENTRY,OTHER,false,false,0835,008,18,70,05,,,2017,02/11/2018 03:57:41 PM,,,',
 '11227293,JB147230,09/09/2017 08:17:00 PM,060XX S EBERHART AVE,0810,THEFT,OVER $500,RESIDENCE,false,false,0313,003,20,42,06,,,2017,02/11/2018 03:57:41 PM,,,',
 '11227634,JB147599,08/26/2017 10:00:00 AM,001XX W RANDOLPH ST,0281,CRIM SEXUAL ASSAULT,NON-AGGRAVATE

In [36]:
import re
from pyspark.sql import Row
commaDelim = re.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)")
header = crimesRDD.first()
crimeMap = crimesRDD.filter(lambda rw:(rw!=header) & (commaDelim.split(rw)[7]=='RESIDENCE')).map(lambda r:(commaDelim.split(r)[5],1))
crimeGp = crimeMap.reduceByKey(lambda x,y:x+y)

+------+----------+
|   cnt|crime_type|
+------+----------+
| 28037|         N|
|  3282|         M|
|   108|         G|
|     1|         P|
|273310|         B|
|   480|         I|
|     2|         O|
|156287|         T|
| 81855|         D|
|201666|         O|
|  9239|         C|
|  6752|         S|
|  8313|         W|
|    14|         N|
|153547|         C|
|  1708|         K|
|  1010|         S|
|  1203|         I|
|     5|         R|
| 25413|         C|
+------+----------+
only showing top 20 rows



In [48]:
crimeDF = crimeGp.map(lambda rw:Row(crime_type=rw[0],cnt=rw[1])).toDF()
crimeDF.show()

+------+--------------------+
|   cnt|          crime_type|
+------+--------------------+
| 28037|           NARCOTICS|
|  3282| MOTOR VEHICLE THEFT|
|   108|            GAMBLING|
|     1|    PUBLIC INDECENCY|
|273310|             BATTERY|
|   480|INTERFERENCE WITH...|
|     2|OTHER NARCOTIC VI...|
|201666|       OTHER OFFENSE|
|156287|               THEFT|
| 81855|  DECEPTIVE PRACTICE|
|  6752|         SEX OFFENSE|
|  9239| CRIM SEXUAL ASSAULT|
|  8313|   WEAPONS VIOLATION|
|    14|      NON - CRIMINAL|
|153547|     CRIMINAL DAMAGE|
|  1708|          KIDNAPPING|
|  1203|        INTIMIDATION|
|  1010|            STALKING|
|     5|           RITUALISM|
| 25413|   CRIMINAL TRESPASS|
+------+--------------------+
only showing top 20 rows



In [45]:
from pyspark.sql.window import Window
spec = Window.orderBy(crimeDF.cnt.desc())

In [46]:
from pyspark.sql.functions import row_number
crimeRnk = crimeDF. \
    withColumn('rank',row_number().over(spec))
crimeRnk.show()

+------+--------------------+----+
|   cnt|          crime_type|rank|
+------+--------------------+----+
|273310|             BATTERY|   1|
|201666|       OTHER OFFENSE|   2|
|156287|               THEFT|   3|
|153547|     CRIMINAL DAMAGE|   4|
|137194|            BURGLARY|   5|
| 81855|  DECEPTIVE PRACTICE|   6|
| 77653|             ASSAULT|   7|
| 28037|           NARCOTICS|   8|
| 26484|OFFENSE INVOLVING...|   9|
| 25413|   CRIMINAL TRESPASS|  10|
|  9239| CRIM SEXUAL ASSAULT|  11|
|  8313|   WEAPONS VIOLATION|  12|
|  6752|         SEX OFFENSE|  13|
|  5147|             ROBBERY|  14|
|  4924|PUBLIC PEACE VIOL...|  15|
|  3282| MOTOR VEHICLE THEFT|  16|
|  2184|               ARSON|  17|
|  1708|          KIDNAPPING|  18|
|  1203|        INTIMIDATION|  19|
|  1010|            STALKING|  20|
+------+--------------------+----+
only showing top 20 rows



In [47]:
crimeRnk.filter(crimeRnk.rank <=3). \
    select('crime_type','cnt'). \
    orderBy(crimeRnk.crime_type,crimeRnk.cnt.desc()). \
    show(truncate=False)

+-------------+------+
|crime_type   |cnt   |
+-------------+------+
|BATTERY      |273310|
|OTHER OFFENSE|201666|
|THEFT        |156287|
+-------------+------+



In [51]:
quatedCommaDelim = spark.read.format('csv').load("/user/pi/quatedCommaDelim.txt").toDF('SL','Name','Age')
quatedCommaDelim.show()

+---+-------------+---+
| SL|         Name|Age|
+---+-------------+---+
|  1| ameen,hashir| 32|
|  2|fathima irine| 30|
|  3| haimi ,Amrin| 40|
|  4|     nazar PP| 60|
+---+-------------+---+



### Using Dataframes

In [3]:
crimesDF = spark.read.format("csv").load('/public/csv/crimes.csv')            

In [19]:
from pyspark.sql.functions import count
crimesTypes = crimesDF. \
        where('_c7 == "RESIDENCE"'). \
        select(crimesDF._c5.alias('CrimeType')). \
        groupBy('CrimeType').agg(count('CrimeType').alias('NoOfIncidents'))
crimesTypes.show()

+--------------------+-------------+
|           CrimeType|NoOfIncidents|
+--------------------+-------------+
|OFFENSE INVOLVING...|        26484|
|CRIMINAL SEXUAL A...|          369|
|            STALKING|         1010|
|PUBLIC PEACE VIOL...|         4924|
|           OBSCENITY|          304|
|NON-CRIMINAL (SUB...|            2|
|               ARSON|         2184|
|            GAMBLING|          108|
|   CRIMINAL TRESPASS|        25413|
|             ASSAULT|        77653|
|      NON - CRIMINAL|           14|
|LIQUOR LAW VIOLATION|          275|
| MOTOR VEHICLE THEFT|         3282|
|               THEFT|       156287|
|             BATTERY|       273310|
|             ROBBERY|         5147|
|           RITUALISM|            5|
|            HOMICIDE|            1|
|    PUBLIC INDECENCY|            1|
| CRIM SEXUAL ASSAULT|         9239|
+--------------------+-------------+
only showing top 20 rows



In [20]:
from pyspark.sql.window import Window
spec = Window.orderBy(crimesTypes.NoOfIncidents.desc())

In [22]:
from pyspark.sql.functions import row_number
Top3CrimeTypesLocResidence = crimesTypes. \
                                withColumn('rw',row_number().over(spec)). \
                                where("rw <= 3")
Top3CrimeTypesLocResidence. \
    select('CrimeType','NoOfIncidents'). \
    write. \
    json("/user/pi/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA")

In [23]:
spark.read.json("/user/pi/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA").show()

+-------------+-------------+
|    CrimeType|NoOfIncidents|
+-------------+-------------+
|      BATTERY|       273310|
|OTHER OFFENSE|       201666|
|        THEFT|       156287|
+-------------+-------------+



In [24]:
! hdfs dfs -ls /user/pi/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA

2020-06-09 15:14:17,314 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   2 pi supergroup          0 2020-06-09 15:13 /user/pi/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA/_SUCCESS
-rw-r--r--   2 pi supergroup        145 2020-06-09 15:13 /user/pi/solutions/solution03/RESIDENCE_AREA_CRIMINAL_TYPE_DATA/part-00000-408aacb5-eebb-42c0-89c4-874b44d0f807-c000.json
