## Spark Developer Training

**Manaranjan Pradhan**<br/>
**manaranjan@enablecloud.com**<br/>
*This notebook is given as part of Spark Training to Participants. Forwarding others is strictly prohibited.*

# Lab: Working with text data

### Parsing weblogs with regular expressions to create a table

* Original Format: %s %s %s [%s] \"%s %s HTTP/1.1\" %s %s
* Example Web Log Row 
 * 10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288

In [0]:
%fs ls /FileStore/tables/logs

path,name,size
dbfs:/FileStore/tables/logs/apache_access.log,apache_access.log,160971


## Create External Table
Create an external table against the weblog data where we define a regular expression format as part of the serializer/deserializer (SerDe) definition.  Instead of writing ETL logic to do this, our table definition handles this.

In [0]:
%sql
DROP TABLE IF EXISTS weblog;
CREATE EXTERNAL TABLE weblog (
  ipaddress STRING,
  clientidentd STRING,
  userid STRING,
  datetime STRING,
  method STRING,
  endpoint STRING,
  protocol STRING,
  responseCode INT,
  contentSize BIGINT
)
ROW FORMAT
  SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \\"(\\S+) (\\S+) (\\S+)\\" (\\d{3}) (\\d+)'
)
LOCATION 
  "/FileStore/tables/logs"

In [0]:
weblog_df = spark.read.table("weblog")

In [0]:
weblog_df.cache()

In [0]:
weblog_df.show(5)

#### Note: You can run a CACHE TABLE statement to  help speed up the performance of the table you query regularly.

In [0]:
%sql
CACHE TABLE weblog;

## Query your weblogs using Spark SQL
Instead of parsing and extracting out the datetime, method, endpoint, and protocol columns; the external table has already done this for you.  Now you can treat your weblog data similar to how you would treat any other structured dataset and write Spark SQL against the table.

In [0]:
%sql
select * from weblog limit 10;

ipaddress,clientidentd,userid,datetime,method,endpoint,protocol,responseCode,contentSize
64.242.88.10,-,-,07/Mar/2004:16:05:49 -0800,GET,/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables,HTTP/1.1,401,12846
64.242.88.10,-,-,07/Mar/2004:16:06:51 -0800,GET,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2,HTTP/1.1,200,4523
64.242.88.10,-,-,07/Mar/2004:16:10:02 -0800,GET,/mailman/listinfo/hsdivision,HTTP/1.1,200,6291
64.242.88.10,-,-,07/Mar/2004:16:11:58 -0800,GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200,7352
64.242.88.10,-,-,07/Mar/2004:16:20:55 -0800,GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200,5253
64.242.88.10,-,-,07/Mar/2004:16:23:12 -0800,GET,/twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore¶m1=1.12¶m2=1.12,HTTP/1.1,200,11382
64.242.88.10,-,-,07/Mar/2004:16:24:16 -0800,GET,/twiki/bin/view/Main/PeterThoeny,HTTP/1.1,200,4924
64.242.88.10,-,-,07/Mar/2004:16:29:16 -0800,GET,/twiki/bin/edit/Main/Header_checks?topicparent=Main.ConfigurationVariables,HTTP/1.1,401,12851
64.242.88.10,-,-,07/Mar/2004:16:30:29 -0800,GET,/twiki/bin/attach/Main/OfficeLocations,HTTP/1.1,401,12851
64.242.88.10,-,-,07/Mar/2004:16:31:48 -0800,GET,/twiki/bin/view/TWiki/WebTopicEditTemplate,HTTP/1.1,200,3732


## Enhanced Spark SQL queries
At this point, we can quickly write SQL group by statements to understand which web page in the logs has the most number of events. But notice that there is a hierarchy of pages within the endpoint column.  We just want want to understand the top level hierarchy - which area such as the Olympics, Casacdes, or Rainier are more popular

In [0]:
%sql
select endpoint, count(1) as Events
  from weblog 
 group by endpoint
 order by Events desc 

endpoint,Events
/twiki/bin/view/Main/WebHome,40
/twiki/pub/TWiki/TWikiLogos/twikiRobot46x50.gif,32
/,31
/favicon.ico,28
/robots.txt,27
/razor.html,23
/twiki/bin/view/Main/SpamAssassinTaggingOnly,18
/twiki/bin/view/Main/SpamAssassinAndPostFix,17
/cgi-bin/mailgraph.cgi/mailgraph_0.png,16
/cgi-bin/mailgraph.cgi/mailgraph_1.png,16
