### Exercise 01 - Get monthly crime count by type

<ul>
<li>Details - Duration 40 minutes
<ul>
<li>Data set <a href="https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2">URL <span class="badge badge-notification clicks" title="815 clicks">815</span></a>
</li>
<li>Choose language of your choice Python or Scala</li>
<li>Data is available in HDFS file system under /public/crime/csv</li>
<li>You can check properties of files using <code>hadoop fs -ls -h /public/crime/csv</code>
</li>
<li>Structure of data (ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location)</li>
<li>File format - text file</li>
<li>Delimiter - “,”</li>
<li>Get monthly count of primary crime type, sorted by month in ascending and number of crimes per type in descending order</li>
<li>Store the result in HDFS path /user/&lt;YOUR_USER_ID&gt;/solutions/solution01/crimes_by_type_by_month</li>
<li>Output File Format: TEXT</li>
<li>Output Columns: Month in YYYYMM format, crime count, crime type</li>
<li>Output Delimiter: \t (tab delimited)</li>
<li>Output Compression: gzip</li>
</ul>
</li>
<li>Validation</li>
<li>Solutions
<ul>
<li>In Scala using <a href="https://gist.github.com/dgadiraju/0c81b0cdae330274fe260bf446108b38">Core API <span class="badge badge-notification clicks" title="676 clicks">676</span></a>
</li>
<li>In Scala using <a href="https://gist.github.com/dgadiraju/1b11b67bc69c79d9718316e81372cdc3">Data Frames and SQL <span class="badge badge-notification clicks" title="850 clicks">850</span></a>
</li>
</ul>
</li>
</ul>

In [None]:
#Core API Logic

In [1]:
#check file size
! hdfs dfs -ls -h /public/csv/*

2020-06-06 17:39:21,194 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   2 pi supergroup      1.6 G 2020-06-06 17:38 /public/csv/crimes.csv


 - since cluster capacity is 24 Cores and 9GB , we will initiate 12 executers and 512M for each
***

pyspark --master yarn --num-executors 3 --executor-cores 2 --executor-memory 1G

In [2]:
crimesRDD = sc.textFile("/public/csv/crimes.csv")
crimesRDD.take(10)

['ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location',
 '11034701,JA366925,01/01/2001 11:00:00 AM,016XX E 86TH PL,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,false,false,0412,004,8,45,11,,,2001,08/05/2017 03:50:08 PM,,,',
 '11227287,JB147188,10/08/2017 03:00:00 AM,092XX S RACINE AVE,0281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,RESIDENCE,false,false,2222,022,21,73,02,,,2017,02/11/2018 03:57:41 PM,,,',
 '11227583,JB147595,03/28/2017 02:00:00 PM,026XX W 79TH ST,0620,BURGLARY,UNLAWFUL ENTRY,OTHER,false,false,0835,008,18,70,05,,,2017,02/11/2018 03:57:41 PM,,,',
 '11227293,JB147230,09/09/2017 08:17:00 PM,060XX S EBERHART AVE,0810,THEFT,OVER $500,RESIDENCE,false,false,0313,003,20,42,06,,,2017,02/11/2018 03:57:41 PM,,,',
 '11227634,JB147599,08/26/2017 10:00:00 AM,001XX W RANDOLPH ST,0281,CRIM SEXUAL ASSAULT,NON-AGGRAVATE

In [27]:
crimesMap = crimesRDD.filter(lambda r:r.split(',')[0]!='ID'). \
        map(lambda rw:((int(rw.split(',')[2].split(' ')[0].split('/')[2]+rw.split(',')[2].split(' ')[0].split('/')[0]),rw.split(',')[5]),1)). \
        reduceByKey(lambda x,y:x+y)
crimesMap.take(10)

[((200106, 'DECEPTIVE PRACTICE'), 1229),
 ((200105, 'THEFT'), 8418),
 ((200103, 'BATTERY'), 7658),
 ((200104, 'BATTERY'), 8325),
 ((200104, 'ASSAULT'), 2746),
 ((200107, 'OFFENSE INVOLVING CHILDREN'), 210),
 ((200106, 'CRIM SEXUAL ASSAULT'), 164),
 ((200109, 'BURGLARY'), 2393),
 ((200109, 'OFFENSE INVOLVING CHILDREN'), 172),
 ((200110, 'INTERFERENCE WITH PUBLIC OFFICER'), 21)]

In [30]:
crimesForSort = crimesMap.map(lambda x:((x[0][0],-x[1]),(str(x[0][0])+'\t'+str(x[1])+'\t'+x[0][1]))).sortByKey()
crimesForSort.take(10)

[((200101, -7866), '200101\t7866\tTHEFT'),
 ((200101, -6525), '200101\t6525\tBATTERY'),
 ((200101, -4714), '200101\t4714\tNARCOTICS'),
 ((200101, -3966), '200101\t3966\tCRIMINAL DAMAGE'),
 ((200101, -2800), '200101\t2800\tOTHER OFFENSE'),
 ((200101, -2123), '200101\t2123\tASSAULT'),
 ((200101, -2095), '200101\t2095\tMOTOR VEHICLE THEFT'),
 ((200101, -1934), '200101\t1934\tBURGLARY'),
 ((200101, -1396), '200101\t1396\tROBBERY'),
 ((200101, -1394), '200101\t1394\tDECEPTIVE PRACTICE')]

In [35]:
crimesForSort.map(lambda x:x[1]).saveAsTextFile("/user/pi/solutions/solution01/crimes_by_type_by_month", \
                                            compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

In [38]:
for i in sc.textFile("/user/pi/solutions/solution01/crimes_by_type_by_month").take(10):
    print(i)

200101	7866	THEFT
200101	6525	BATTERY
200101	4714	NARCOTICS
200101	3966	CRIMINAL DAMAGE
200101	2800	OTHER OFFENSE
200101	2123	ASSAULT
200101	2095	MOTOR VEHICLE THEFT
200101	1934	BURGLARY
200101	1396	ROBBERY
200101	1394	DECEPTIVE PRACTICE


In [40]:
! hdfs dfs -ls /user/pi/solutions/solution01/crimes_by_type_by_month/*

2020-06-06 19:33:07,172 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   2 pi supergroup          0 2020-06-06 19:31 /user/pi/solutions/solution01/crimes_by_type_by_month/_SUCCESS
-rw-r--r--   2 pi supergroup       4032 2020-06-06 19:31 /user/pi/solutions/solution01/crimes_by_type_by_month/part-00000.gz
-rw-r--r--   2 pi supergroup       2276 2020-06-06 19:31 /user/pi/solutions/solution01/crimes_by_type_by_month/part-00001.gz
-rw-r--r--   2 pi supergroup       1938 2020-06-06 19:31 /user/pi/solutions/solution01/crimes_by_type_by_month/part-00002.gz
-rw-r--r--   2 pi supergroup       2427 2020-06-06 19:31 /user/pi/solutions/solution01/crimes_by_type_by_month/part-00003.gz
-rw-r--r--   2 pi supergroup       2322 2020-06-06 19:31 /user/pi/solutions/solution01/crimes_by_type_by_month/part-00004.gz
-rw-r--r--   2 pi supergroup       2824 2020-06-06 19:31 /user/pi/solutions/solution01/crimes_by_type

In [41]:
#to reduce no of files
! hdfs dfs -rm -r -f /user/pi/solutions/solution01/crimes_by_type_by_month

2020-06-06 19:33:31,088 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Deleted /user/pi/solutions/solution01/crimes_by_type_by_month


In [42]:
crimesForSort.map(lambda x:x[1]).coalesce(2).saveAsTextFile("/user/pi/solutions/solution01/crimes_by_type_by_month", \
                                            compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

In [43]:
for i in sc.textFile("/user/pi/solutions/solution01/crimes_by_type_by_month").take(10):
    print(i)

200101	7866	THEFT
200101	6525	BATTERY
200101	4714	NARCOTICS
200101	3966	CRIMINAL DAMAGE
200101	2800	OTHER OFFENSE
200101	2123	ASSAULT
200101	2095	MOTOR VEHICLE THEFT
200101	1934	BURGLARY
200101	1396	ROBBERY
200101	1394	DECEPTIVE PRACTICE


In [44]:
! hdfs dfs -ls /user/pi/solutions/solution01/crimes_by_type_by_month/*

2020-06-06 19:34:15,834 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   2 pi supergroup          0 2020-06-06 19:33 /user/pi/solutions/solution01/crimes_by_type_by_month/_SUCCESS
-rw-r--r--   2 pi supergroup      16232 2020-06-06 19:33 /user/pi/solutions/solution01/crimes_by_type_by_month/part-00000.gz
-rw-r--r--   2 pi supergroup      16244 2020-06-06 19:33 /user/pi/solutions/solution01/crimes_by_type_by_month/part-00001.gz
