## Init Connection

In [None]:
%load_ext sql
%sql hive://hadoop@localhost:10000/text

# Speed

With the magic command `%time`, we can measure how long a cell took to execute it. 
Check out the `Wall time` for the three datasets.

Hive needs time to compile the sql, submit the job, running the mappers and the reducers.

In [None]:
%time %sql select count(*) from raw_small

In [None]:
%time %sql select count(*) from raw_holmes

In [None]:
%time %sql select count(*) from raw_gutenberg

## Word Count - Step by Step with `raw_holmes`

In [None]:
# get a sneak peak at the data
%sql select * from raw_holmes limit 3

In [None]:
# we have one column called `line`. Let's use only that one
%sql select line from raw_holmes limit 3

In [None]:
# Let's trim the line
%sql select trim(line) from raw_holmes limit 3

In [None]:
# Let's trim the line...
# and give the column the `line` name again
%sql select trim(line) line from raw_holmes limit 3

In [None]:
# In the above command we requested three lines but only got two. The reason is that line 2 is empty. Let's filter them out
%sql select trim(line) line from raw_holmes  where line != '' limit 3

In [None]:
#one way to get the words is to use the split function
%sql select split(trim(line), ' ') words from raw_holmes where line != '' limit 3

In [None]:
#However, there is a sentence built in function which "tokenizes a string of natural language text into words and sentences"
#E.g. the comma in `Holmes,` has been removed in the first line
%sql select sentences(trim(line)) sentences from raw_holmes where line != '' limit 3

In [None]:
# The sentences function gives us back an array of array of strings. (words -> sentences -> line)
# let's explode once
%sql select explode(sentences(trim(line))) sentence from raw_holmes where line != '' limit 3

The sentences function gives us back an array of array of strings. (words -> sentences -> line)
let's explode once more, note, however, that we cannot explode two times in a row so we have to create a subquery

In [None]:
%%sql
select sentence from (
    select explode(sentences(trim(line))) sentence from raw_holmes where line != ''
) sentence_table limit 4



In [None]:
%%sql
select explode(sentence) word from (
    select explode(sentences(trim(line))) sentence from raw_holmes where line != ''
) sentence_table limit 4



Let us `lower` the individual words.

In [None]:
%%sql
select lower(word) as word from (
    select explode(sentence) word from (
    select explode(sentences(trim(line))) sentence from raw_holmes where line != ''
    ) sentence_table 
) word_table limit 4

## Saving the result to a new table

In [None]:
%%sql
CREATE TABLE word_holmes 
AS select lower(word) as word from (
    select explode(sentence) word from (
    select explode(sentences(trim(line))) sentence from raw_holmes where line != ''
    ) sentence_table 
) word_table

In [None]:
%sql show tables

In [None]:
%sql select * from word_holmes limit 3

# Word Count

The word count is now trivial with `sql` 💫

In [None]:
%%sql

SELECT
    word, count(word) as count
FROM
    word_holmes
GROUP BY
    word
ORDER BY
    count DESC
LIMIT 10

We can save the results again to a new table

In [None]:
%%sql 
CREATE TABLE word_count_holmes 
AS
    SELECT
        word, count(word) as count
    FROM
        word_holmes
    GROUP BY
        word
    ORDER BY
        count DESC

In [None]:
%sql select * from word_count_holmes where word in ('he', 'she', 'it')

# Can you do the same for the `raw_gutenberg` data?

In [None]:
# your solution

# Comparing Gutenberg WordCount with OEC Rank for the Top 20 Words

From Wikipedia [100 most common words](https://en.wikipedia.org/wiki/Most_common_words_in_English)

Can you compare our findings with the ones listed here (from wikipedia)

|word|place|
| ----------- | ----------- |
|the|1|
|be|2|
|to|3|
|of|4|
|and|5|
|a|6|
|in|7|
|that|9|
|have|9|
|i|10|
|it|11|
|for|12|
|not| 13|
|on|14|
|with|15|
|he|16|
|as| 17|
|you|18|
|do|19|
|at|20|

In [None]:
# your solution