# Problem 2 - Working with SparkSQL

Do not insert any additional cells than the ones that are provided.

Create your SparkContext and SparkSession:

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SparkSession
spark = SparkSession.builder \
     .appName("Test SparkSession") \
     .getOrCreate()

## Quazyilx again!

Yes, you remember it. As a reminder, here is the description of the files.

The quazyilx has been malfunctioning, and occasionally generates output with a `-1` for all four measurements, like this:

    2015-12-10T08:40:10Z fnard:-1 fnok:-1 cark:-1 gnuck:-1

There are four different versions of the _quazyilx_ file, each of a different size. As you can see in the output below the file sizes are 50MB (1,000,000 rows), 4.8GB (100,000,000 rows), 18GB (369,865,098 rows) and 36.7GB (752,981,134 rows). The only difference is the length of the number of records, the file structure is the same.

```
[hadoop@ip-172-31-1-240 ~]$ hadoop fs -ls s3://bigdatateaching/quazyilx/
Found 4 items
-rw-rw-rw-   1 hadoop hadoop    52443735 2018-01-25 15:37 s3://bigdatateaching/quazyilx/quazyilx0.txt
-rw-rw-rw-   1 hadoop hadoop  5244417004 2018-01-25 15:37 s3://bigdatateaching/quazyilx/quazyilx1.txt
-rw-rw-rw-   1 hadoop hadoop 19397230888 2018-01-25 15:38 s3://bigdatateaching/quazyilx/quazyilx2.txt
-rw-rw-rw-   1 hadoop hadoop 39489364082 2018-01-25 15:41 s3://bigdatateaching/quazyilx/quazyilx3.txt
```

You will use Spark and SparkSQL to create a Spark DataFrame and then run some analysis on the files using SparkSQL queries.

First, in the following cell, create an RDD called `quazyilx` that reads the `quazyilx1.txt` file from S3.

In [2]:
quazyilx = sc.textFile("s3://bigdatateaching/quazyilx/quazyilx1.txt")

In the next cell, look at the first 50 elements of `quazyilx` to make sure everything is working corectly. This should take a few seconds.

In [3]:
quazyilx.take(50)

['2000-01-01 00:00:03 fnard:7 fnok:8 cark:19 gnuck:25',
 '2000-01-01 00:00:08 fnard:14 fnok:19 cark:16 gnuck:37',
 '2000-01-01 00:00:17 fnard:12 fnok:11 cark:12 gnuck:8',
 '2000-01-01 00:00:22 fnard:18 fnok:16 cark:3 gnuck:8',
 '2000-01-01 00:00:32 fnard:7 fnok:16 cark:7 gnuck:37',
 '2000-01-01 00:00:40 fnard:6 fnok:14 cark:3 gnuck:30',
 '2000-01-01 00:00:47 fnard:11 fnok:10 cark:17 gnuck:7',
 '2000-01-01 00:00:55 fnard:9 fnok:14 cark:13 gnuck:30',
 '2000-01-01 00:00:56 fnard:10 fnok:1 cark:7 gnuck:6',
 '2000-01-01 00:00:59 fnard:11 fnok:11 cark:12 gnuck:18',
 '2000-01-01 00:01:03 fnard:9 fnok:13 cark:14 gnuck:49',
 '2000-01-01 00:01:06 fnard:12 fnok:10 cark:19 gnuck:30',
 '2000-01-01 00:01:16 fnard:0 fnok:12 cark:19 gnuck:26',
 '2000-01-01 00:01:26 fnard:10 fnok:11 cark:10 gnuck:49',
 '2000-01-01 00:01:30 fnard:9 fnok:5 cark:16 gnuck:13',
 '2000-01-01 00:01:38 fnard:11 fnok:10 cark:7 gnuck:47',
 '2000-01-01 00:01:43 fnard:2 fnok:2 cark:20 gnuck:35',
 '2000-01-01 00:01:53 fnard:12 fnok

You will now need to work with the RDD to be able to make a DataFrame. In the following cell, create python class called `quazyilx_class` that processes a line and returns attributes for `.time`, `.fnard`, `.fnok` and `.cark`. 

You will need to define the Regular Expression and complete the class where it says `#Put your code here:`

In [33]:
import sys
import os,datetime,re

QUAZYILX_RE = "(....-..-.. ..:..:..) fnard:(-?[\d]+) fnok:(-?[\d]+) cark:(-?[\d]+) gnuck:(-?[\d]+)"
quazyilx_re = re.compile(QUAZYILX_RE)

class quazyilx_class():
    def __init__(self,line):
        value = re.match(quazyilx_re,line)
        try:
            self.time = str(value.group(1))
            self.fnard = int(value.group(2))
            self.fnok = int(value.group(3)) 
            self.cark = int(value.group(4))
            self.gnuck = int(value.group(5))
            return
        except:
            self.time = None
            self.fnard = None
            self.fnok = None
            self.cark = None
            self.gnuck = None
            return

You will then need to turn the quazyilx RDD into a `Row()` object. You can do that with a lambda function, like this:

```(python)
lambda q:Row(datetime=q.datetime.isoformat(),fnard=q.fnard,fnok=q.fnok,cark=q.cark,gnuck=q.gnuck))
```

Alternatively, you can add a new method to the Quazyilx class called `.Row()` that returns a Row. All of these ways are more or less equivalent. You just need to pick one of them.  You may find it useful to look at [this documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection).

In the next cell, create an RDD called `line` that converts the `quazyilx` RDD into a `Row()` object using the `quazyilx_class`.

In [43]:
from pyspark.sql import Row

line = quazyilx.map(lambda line:Row(time = datetime.datetime.strptime(quazyilx_class(line).time,"%Y-%m-%d %H:%M:%S").isoformat(),
                            fnard = quazyilx_class(line).fnard,
                            fnok = quazyilx_class(line).fnok,
                            cark = quazyilx_class(line).cark,
                            gnuck = quazyilx_class(line).gnuck))

Look at the first 10 rows to make sure everything is working.

In [44]:
line.take(10)

[Row(cark=19, fnard=7, fnok=8, gnuck=25, time='2000-01-01T00:00:03'),
 Row(cark=16, fnard=14, fnok=19, gnuck=37, time='2000-01-01T00:00:08'),
 Row(cark=12, fnard=12, fnok=11, gnuck=8, time='2000-01-01T00:00:17'),
 Row(cark=3, fnard=18, fnok=16, gnuck=8, time='2000-01-01T00:00:22'),
 Row(cark=7, fnard=7, fnok=16, gnuck=37, time='2000-01-01T00:00:32'),
 Row(cark=3, fnard=6, fnok=14, gnuck=30, time='2000-01-01T00:00:40'),
 Row(cark=17, fnard=11, fnok=10, gnuck=7, time='2000-01-01T00:00:47'),
 Row(cark=13, fnard=9, fnok=14, gnuck=30, time='2000-01-01T00:00:55'),
 Row(cark=7, fnard=10, fnok=1, gnuck=6, time='2000-01-01T00:00:56'),
 Row(cark=12, fnard=11, fnok=11, gnuck=18, time='2000-01-01T00:00:59')]

In the following cell, convert the quazyilx RDD into a DataFrame `quazyilx_df` using the `spark.createDataFrame` method, register it as the SQL table `quazyilx_tbl` with the method `.createOrReplaceTempView`. You will want to cache the DataFrame so it doesn't get generated every time you run a query.

In [45]:
quazyilx_df = spark.createDataFrame(line)
quazyilx_df.createOrReplaceTempView("quazyilx_tbl")
quazyilx_df.cache()
quazyilx_df.show()

+----+-----+----+-----+-------------------+
|cark|fnard|fnok|gnuck|               time|
+----+-----+----+-----+-------------------+
|  19|    7|   8|   25|2000-01-01T00:00:03|
|  16|   14|  19|   37|2000-01-01T00:00:08|
|  12|   12|  11|    8|2000-01-01T00:00:17|
|   3|   18|  16|    8|2000-01-01T00:00:22|
|   7|    7|  16|   37|2000-01-01T00:00:32|
|   3|    6|  14|   30|2000-01-01T00:00:40|
|  17|   11|  10|    7|2000-01-01T00:00:47|
|  13|    9|  14|   30|2000-01-01T00:00:55|
|   7|   10|   1|    6|2000-01-01T00:00:56|
|  12|   11|  11|   18|2000-01-01T00:00:59|
|  14|    9|  13|   49|2000-01-01T00:01:03|
|  19|   12|  10|   30|2000-01-01T00:01:06|
|  19|    0|  12|   26|2000-01-01T00:01:16|
|  10|   10|  11|   49|2000-01-01T00:01:26|
|  16|    9|   5|   13|2000-01-01T00:01:30|
|   7|   11|  10|   47|2000-01-01T00:01:38|
|  20|    2|   2|   35|2000-01-01T00:01:43|
|  20|   12|  11|    3|2000-01-01T00:01:53|
|  16|    6|   6|   18|2000-01-01T00:01:54|
|  17|    9|  10|   15|2000-01-0

Once you create and register the dataframe and table, you will run SQL queries using  `spark.sql()` to calculate the following:

1. The number of rows in the dataset
1. The number of lines that has -1 for `fnard`, `fnok`, `cark` and `gnuck`.
1. The number of lines that have -1 for `fnard` but have `fnok > 5` and `cark > 5`
1. The first datetime in the dataset
1. The first datetime that has -1 for all of the values
1. The last datetime in the dataset
1. The last datetime that has a -1 for all of the values

Place each query into each of the following  seven cells and run it to get the results. Remember, running the query statement itselft will not give you the results you want. You need to do something else to "get" the result.

**Note: in development, the first query may take approximately 10-15 minutes to run with the cluster configuration for this assignment (1 master, 4 task nodes of m4.xlarge). If you cache() correctly, all subsequent queries should take no more than 5 seconds.**


In [46]:
# The number of rows in the dataset
spark.sql("select count(*) from quazyilx_tbl").show()

+---------+
| count(1)|
+---------+
|100000000|
+---------+



In [47]:
# Number of lines that has -1 for fnard,fnok,cark and gnuck
spark.sql("""select count(*) 
             from quazyilx_tbl 
             where fnard = -1 
             AND fnok = -1 
             AND cark = -1 
             AND gnuck =-1""").show()

+--------+
|count(1)|
+--------+
|     190|
+--------+



In [48]:
# Number of lines that have -1 for fnard but have fnok>5, cark>5
spark.sql("""select count(*) from quazyilx_tbl where fnard=-1 and fnok>5 and cark>5""").show()

+--------+
|count(1)|
+--------+
| 2114009|
+--------+



In [49]:
# first datetime in the dataset
spark.sql("""select min(time) as first_date from quazyilx_tbl """).show()

+-------------------+
|         first_date|
+-------------------+
|2000-01-01T00:00:03|
+-------------------+



In [50]:
# first datetime that has -1 for all values
spark.sql("""select * 
             from quazyilx_tbl
             where cark=-1
             AND fnard=-1
             AND fnok=-1
             AND gnuck=-1
             order by time ASC
             limit 1""").show()

+----+-----+----+-----+-------------------+
|cark|fnard|fnok|gnuck|               time|
+----+-----+----+-----+-------------------+
|  -1|   -1|  -1|   -1|2000-01-28T03:07:44|
+----+-----+----+-----+-------------------+



In [51]:
# last datetime in the dataset
spark.sql("""select max(time) as last_date from quazyilx_tbl """).show()

+-------------------+
|          last_date|
+-------------------+
|2017-06-05T18:03:07|
+-------------------+



In [52]:
# last datetime that has -1 for all values
spark.sql("""select * 
             from quazyilx_tbl
             where cark=-1
             AND fnard=-1
             AND fnok=-1
             AND gnuck=-1
             order by time DESC
             limit 1""").show()

+----+-----+----+-----+-------------------+
|cark|fnard|fnok|gnuck|               time|
+----+-----+----+-----+-------------------+
|  -1|   -1|  -1|   -1|2017-04-21T04:57:10|
+----+-----+----+-----+-------------------+



When you finish this problem, click on the File -> 'Save and Checkpoint' in the menu bar to make sure that the latest version of the workbook file is saved. Also, before you close this notebook and move on, make sure you disconnect your SparkContext, otherwise you will not be able to re-allocate resources. Remember, you will commit the .ipynb file to the repository for submission (in the master node terminal.)

In [None]:
sc.stop()