# Raw data exploration (Scala)

This notebook contains some data exploration in scala.  It assumes you've already run the ETL process (see README or etl notebook).

In [1]:
// just like with pyspark notebook, we get a spark context for free
spark

org.apache.spark.sql.SparkSession@24d30bd6

In [2]:
// really useful imports
val s = spark
import s.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

s = org.apache.spark.sql.SparkSession@24d30bd6


org.apache.spark.sql.SparkSession@24d30bd6

In [3]:
// reading a dataframe looks basically identical to pyspark
val df = s.read.parquet("parquet_data/rocdev/events.parquet")

df = [attachments: array<struct<actions:array<struct<confirm:struct<dismiss_text:string,ok_text:string,text:string,title:string>,id:string,name:string,style:string,text:string,type:string,value:string>>,audio_html:string,audio_html_height:bigint,audio_html_width:bigint,author_icon:string,author_id:string,author_link:string,author_name:string,author_subname:string,callback_id:string,channel_id:string,channel_name:string,color:string,fallback:string,fields:array<struct<short:boolean,title:string,value:string>>,files:array<struct<created:bigint,deanimate_gif:string,display_as_bot:boolean,editable:boolean,external_type:string,filetype:string,id:string,image_exif_rotation:bigint,is_external:boolean,is_public:boolean,is_starred:boolean,mimetype:string,mode:stri...


[attachments: array<struct<actions:array<struct<confirm:struct<dismiss_text:string,ok_text:string,text:string,title:string>,id:string,name:string,style:string,text:string,type:string,value:string>>,audio_html:string,audio_html_height:bigint,audio_html_width:bigint,author_icon:string,author_id:string,author_link:string,author_name:string,author_subname:string,callback_id:string,channel_id:string,channel_name:string,color:string,fallback:string,fields:array<struct<short:boolean,title:string,value:string>>,files:array<struct<created:bigint,deanimate_gif:string,display_as_bot:boolean,editable:boolean,external_type:string,filetype:string,id:string,image_exif_rotation:bigint,is_external:boolean,is_public:boolean,is_starred:boolean,mimetype:string,mode:stri...

In [4]:
df.count()

168340

In [5]:
df.filter(col("ts").isNull).show

+-----------+------+--------+-------+-------------+-------+----+--------------+------+----+-----+------+-----+-------+--------+----+---------+------------+----+-------------+--------+--------------+-------+---------+-------+-----------+-----------+-----------------+----+-------+----+---------+-----+---+----+------+---------------+----+--------+
|attachments|bot_id|bot_link|channel|client_msg_id|comment|date|display_as_bot|edited|file|files|hidden|icons|inviter|is_intro|item|item_type|latest_reply|name|new_broadcast|old_name|parent_user_id|purpose|reactions|replies|reply_count|reply_users|reply_users_count|root|subtype|text|thread_ts|topic| ts|type|upload|upload_reply_to|user|username|
+-----------+------+--------+-------+-------------+-------+----+--------------+------+----+-----+------+-----+-------+--------+----+---------+------------+----+-------------+--------+--------------+-------+---------+-------+-----------+-----------+-----------------+----+-------+----+---------+-----+---+--

In [6]:
// group by channel and count - note again it looks almost identical to the pyspark
df.groupBy("channel").count.select("channel", "count").orderBy(col("count").desc).show

+-----------------+-----+
|          channel|count|
+-----------------+-----+
|          general|71493|
|          careers|16418|
|        mentoring| 9293|
|         politics| 6754|
|          paychex| 5639|
|       javascript| 5158|
|         security| 4267|
|           gaming| 3910|
|   remote-workers| 2839|
|american-football| 2777|
|           devops| 2602|
|           python| 2440|
|              git| 2397|
|           random| 2362|
|             food| 2138|
|              www| 1897|
|           status| 1632|
|fakeinternetmoney| 1542|
|   ethics-in-tech| 1496|
|           dotnet| 1424|
+-----------------+-----+
only showing top 20 rows



In [7]:
// let's do a word count by dropping down to the rdd API, which is much easier to do in scala
df
    .select("text")
    .filter(col("text").isNotNull)
    .rdd
    .flatMap(line => line.getAs[String](0).split(" "))
    .filter(w => w.length > 0)
    .map(w => (w, 1))
    .reduceByKey(_ + _)
    .toDF("word", "count")
    .orderBy(col("count").desc)
    .show

+----+-----+
|word|count|
+----+-----+
| the|75926|
|  to|63417|
|   a|57878|
|   I|54112|
| and|36504|
|  of|33911|
|that|30936|
|  is|28890|
|  in|26199|
|  it|25950|
| for|25096|
| you|20649|
|  on|16417|
|have|16274|
| but|15996|
|  my|14653|
|with|14608|
|  be|14036|
| was|12969|
| not|12478|
+----+-----+
only showing top 20 rows

