In [60]:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

In [61]:
import sqlContext.implicits._

In [62]:
import org.apache.spark.sql.types._

#### 1. Create "histogram" of counts by home_ownership type

In [63]:
val df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load("hdfs://sandbox.hortonworks.com:8020/tmp/loan.csv")

In [64]:
val home_own = df.groupBy("home_ownership").count().orderBy($"count".desc)

In [65]:
home_own.show()

                                                                                +--------------+------+
|home_ownership| count|
+--------------+------+
|      MORTGAGE|443455|
|          RENT|355986|
|           OWN| 87449|
|         OTHER|   181|
|          NONE|    50|
|           ANY|     3|
+--------------+------+



#### 2. a) Since we have a lot of columns, and it can be expensive to walk every column of a row in a query, we might try a columnar data format like parquet. With our data in columnar format, we only look through the columns we are interested in a given query.

#### 2. b) Convert to parquet and run a couple queries

In [66]:
df.write.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/loan_columnar.csv") //only need to run once

Name: org.apache.spark.sql.AnalysisException
Message: path hdfs://sandbox.hortonworks.com:8020/tmp/loan_columnar.csv already exists.;
StackTrace:   at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:80)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQu

In [67]:
val pq_df = sqlContext.read.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/loan_columnar.csv")

In [68]:
val home_own_pq = pq_df.groupBy("home_ownership").count().orderBy($"count".desc)

In [69]:
home_own_pq.show()

+--------------+------+
|home_ownership| count|
+--------------+------+
|      MORTGAGE|443455|
|          RENT|355986|
|           OWN| 87449|
|         OTHER|   181|
|          NONE|    50|
|           ANY|     3|
+--------------+------+



That ran quite a bit faster than the groupBy count with the csv file.

Let's try some other operations.

In [42]:
pq_df.groupBy("application_type").pivot("purpose").count().show()

+----------------+----+-----------+------------------+-----------+----------------+-----+--------------+-------+------+-----+----------------+--------------+--------+-------+
|application_type| car|credit_card|debt_consolidation|educational|home_improvement|house|major_purchase|medical|moving|other|renewable_energy|small_business|vacation|wedding|
+----------------+----+-----------+------------------+-----------+----------------+-----+--------------+-------+------+-----+----------------+--------------+--------+-------+
|           JOINT|   1|        115|               334|       null|              26| null|             2|      2|     1|   27|            null|             2|       1|   null|
|      INDIVIDUAL|8858|     206022|            523794|        411|           51786| 3702|         17259|   8531|  5412|42840|             575|         10345|    4735|   2343|
+----------------+----+-----------+------------------+-----------+----------------+-----+--------------+-------+------+-----+

In [43]:
df.groupBy("application_type").pivot("purpose").count().show()

+----------------+----+-----------+------------------+-----------+----------------+-----+--------------+-------+------+-----+----------------+--------------+--------+-------+
|application_type| car|credit_card|debt_consolidation|educational|home_improvement|house|major_purchase|medical|moving|other|renewable_energy|small_business|vacation|wedding|
+----------------+----+-----------+------------------+-----------+----------------+-----+--------------+-------+------+-----+----------------+--------------+--------+-------+
|           JOINT|   1|        115|               334|       null|              26| null|             2|      2|     1|   27|            null|             2|       1|   null|
|      INDIVIDUAL|8858|     206022|            523794|        411|           51786| 3702|         17259|   8531|  5412|42840|             575|         10345|    4735|   2343|
+----------------+----+-----------+------------------+-----------+----------------+-----+--------------+-------+------+-----+

In [79]:
pq_df.withColumn("loan_amnt_int", pq_df("loan_amnt").cast(IntegerType)).groupBy("loan_status", "zip_code").avg("loan_amnt_int").orderBy($"loan_status".desc, $"zip_code".asc).show()

+------------------+--------+------------------+
|       loan_status|zip_code|avg(loan_amnt_int)|
+------------------+--------+------------------+
|Late (31-120 days)|   010xx|           14355.0|
|Late (31-120 days)|   011xx| 15930.76923076923|
|Late (31-120 days)|   012xx|16416.666666666668|
|Late (31-120 days)|   013xx|           10500.0|
|Late (31-120 days)|   014xx| 18867.30769230769|
|Late (31-120 days)|   015xx| 19110.29411764706|
|Late (31-120 days)|   016xx|12338.888888888889|
|Late (31-120 days)|   017xx| 15294.23076923077|
|Late (31-120 days)|   018xx|14456.060606060606|
|Late (31-120 days)|   019xx| 17669.31818181818|
|Late (31-120 days)|   020xx| 22032.14285714286|
|Late (31-120 days)|   021xx|           15600.0|
|Late (31-120 days)|   023xx|        15057.8125|
|Late (31-120 days)|   024xx|          20743.75|
|Late (31-120 days)|   025xx|           14900.0|
|Late (31-120 days)|   026xx|           17540.0|
|Late (31-120 days)|   027xx| 19380.68181818182|
|Late (31-120 days)|

In [80]:
df.withColumn("loan_amnt_int", df("loan_amnt").cast(IntegerType)).groupBy("loan_status", "zip_code").avg("loan_amnt_int").orderBy($"loan_status".desc, $"zip_code".asc).show()

                                                                                +------------------+--------+------------------+
|       loan_status|zip_code|avg(loan_amnt_int)|
+------------------+--------+------------------+
|Late (31-120 days)|   010xx|           14355.0|
|Late (31-120 days)|   011xx| 15930.76923076923|
|Late (31-120 days)|   012xx|16416.666666666668|
|Late (31-120 days)|   013xx|           10500.0|
|Late (31-120 days)|   014xx| 18867.30769230769|
|Late (31-120 days)|   015xx| 19110.29411764706|
|Late (31-120 days)|   016xx|12338.888888888889|
|Late (31-120 days)|   017xx| 15294.23076923077|
|Late (31-120 days)|   018xx|14456.060606060606|
|Late (31-120 days)|   019xx| 17669.31818181818|
|Late (31-120 days)|   020xx| 22032.14285714286|
|Late (31-120 days)|   021xx|           15600.0|
|Late (31-120 days)|   023xx|        15057.8125|
|Late (31-120 days)|   024xx|          20743.75|
|Late (31-120 days)|   025xx|           14900.0|
|Late (31-120 days)|   026xx|       

Parquet seems to be a bit faster

#### 3. Counts by loan_status by home_ownership

In [62]:
val loan_stat_pq = pq_df.groupBy("home_ownership", "loan_status").count().orderBy($"home_ownership".desc, $"count".desc)

In [63]:
loan_stat_pq.show()

+--------------+--------------------+------+
|home_ownership|         loan_status| count|
+--------------+--------------------+------+
|          RENT|             Current|235966|
|          RENT|          Fully Paid| 84547|
|          RENT|         Charged Off| 21293|
|          RENT|  Late (31-120 days)|  5360|
|          RENT|              Issued|  3202|
|          RENT|     In Grace Period|  2761|
|          RENT|   Late (16-30 days)|   996|
|          RENT|Does not meet the...|   903|
|          RENT|             Default|   611|
|          RENT|Does not meet the...|   347|
|           OWN|             Current| 62041|
|           OWN|          Fully Paid| 17945|
|           OWN|         Charged Off|  4021|
|           OWN|  Late (31-120 days)|  1212|
|           OWN|              Issued|  1038|
|           OWN|     In Grace Period|   637|
|           OWN|   Late (16-30 days)|   260|
|           OWN|Does not meet the...|   137|
|           OWN|             Default|   110|
|         

#### 4. Any loans originate in King County based on wa_zipcodes.csv?

In [81]:
val wa_zip_df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load("hdfs://sandbox.hortonworks.com:8020/tmp/wa_zipcodes.csv")

Let's get all zipcodes to an array for checking, since we're only interested in the first three digits, we'll drop the rest of the zipcode and make a set

In [100]:
val wa_zips = wa_zip_df.select("Zipcode").rdd.map(r => r(0).toString.dropRight(2).concat("xx")).collect().distinct

In [101]:
wa_zips.take(10)

Array(980xx, 981xx, 982xx)

In [114]:
val res = pq_df.filter($"zip_code".isin(wa_zips:_*))

In [115]:
res.count

9576

Yes, there are evidently 9576 loans originating in King County