In [1]:
%load_ext sql
%config SqlMagic.autocommit=False # avoiding the error: FAILED: IllegalStateException COMMIT is not supported yet.
%sql hive://hadoop@localhost:10000/text

In [2]:
# we will use pandas in this exercise 
import pandas as pd

## Create a Mapping: Character -> Frequency

### Bonus Question
Why is it wrong to use the dataset `word_count_gutenberg`?

### Answer

It does not give us the real frequencies of letters in the language. E.g. `the` was reduced to one word which under-represents the letters `t`, `h` and `e`.

In [3]:
%%sql real_frequency <<

SELECT
    character,
    count(character) as character_count,
    round(100 * count(character)/sum(count(character)) over (), 2) as percentage
    FROM (
        SELECT
            explode(split(word,'')) as character
        FROM
            word_gutenberg
        ) chars
WHERE
    character RLIKE '^[a-z]$'
GROUP BY 
    character
ORDER BY 
    character ASC


 * hive://hadoop@localhost:10000/text
Done.
Returning data to local variable real_frequency


## Convert `real_frequency` to a DataFrame

In [6]:
df_real_frequency = real_frequency.DataFrame()

In [7]:
df_real_frequency

Unnamed: 0,character,character_count,percentage
0,a,649718952,8.38
1,b,116527275,1.5
2,c,242972090,3.13
3,d,314688546,4.06
4,e,978448690,12.62
5,f,167275669,2.16
6,g,177223931,2.29
7,h,400875383,5.17
8,i,546780881,7.05
9,j,19052935,0.25


## Most Used Character in the English Language

Given `df_real_frequency`, sort the `DataFrame` based on the `character_count` (`DESC`) to find the most used characters.
- [Help](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [32]:
df_real_frequency.sort_values(by=['character_count'], ascending=False)

Unnamed: 0,character,character_count,percentage
4,e,978448690,12.62
19,t,704615651,9.09
0,a,649718952,8.38
13,n,560411572,7.23
14,o,557575677,7.19
8,i,546780881,7.05
18,s,491275028,6.34
17,r,474360217,6.12
7,h,400875383,5.17
11,l,322452003,4.16


## Loading `cipher.txt` as a Temp Table

In [9]:
! head cipher.txt

gzwfxp
esp bflyetej zq rpypclepo lyo dezcpo olel.
esp dtkp zq esp olel opepcxtypd esp glwfp lyo azepyetlw tydtrse, lyo hspespc te nly mp nzydtopcpo mtr olel zc yze.
esp dtkp zq mtr olel td fdflwwj wlcrpc esly epclmjepd lyo apelmjepd.

glctpej
esp ejap lyo ylefcp zq esp olel.
esp plcwtpc epnsyzwzrtpd wtvp comxdd hpcp nlalmwp ez slyowp decfnefcpo olel pqqtntpyewj lyo pqqpnetgpwj.
szhpgpc, esp nslyrp ty ejap lyo ylefcp qczx decfnefcpo ez dpxt-decfnefcpo zc fydecfnefcpo nslwwpyrpo esp pitdetyr ezzwd lyo epnsyzwzrtpd.
esp mtr olel epnsyzwzrtpd pgzwgpo htes esp actxp tyepyetzy ez nlaefcp, dezcp, lyo acznpdd esp dpxt-decfnefcpo lyo fydecfnefcpo (glctpej) olel rpypclepo htes strs dappo (gpwzntej), lyo sfrp ty dtkp (gzwfxp).


### Creating a Temp Table

In [10]:
%sql CREATE TEMPORARY EXTERNAL TABLE cipher(line string)

 * hive://hadoop@localhost:10000/text
Done.


[]

### Loading Data into the Table

In [12]:
%sql LOAD DATA LOCAL INPATH '/home/hadoop/BDLC_FS23/V04/V04_exercises_material/1_Text_Analysis/cipher.txt' INTO TABLE cipher

 * hive://hadoop@localhost:10000/text
Done.


[]

### Check if it worked

In [13]:
%sql show tables

 * hive://hadoop@localhost:10000/text
Done.


tab_name
cipher
raw_gutenberg
raw_holmes
raw_small
word_count_gutenberg
word_count_holmes
word_gutenberg
word_holmes


In [14]:
%sql select * from cipher limit 5

 * hive://hadoop@localhost:10000/text
Done.


line
gzwfxp
esp bflyetej zq rpypclepo lyo dezcpo olel.
"esp dtkp zq esp olel opepcxtypd esp glwfp lyo azepyetlw tydtrse, lyo hspespc te nly mp nzydtopcpo mtr olel zc yze."
esp dtkp zq mtr olel td fdflwwj wlcrpc esly epclmjepd lyo apelmjepd.


## Frequencies for Cipher

Do the same analysis again. Save the results into `cipher_frequency` and also convert it to a DataFrame.

In [29]:
%%sql

CREATE TEMPORARY TABLE cipher_word AS
SELECT explode(sentence) as word
FROM 
    (
        SELECT
            explode(sentences(line)) as sentence
        FROM
            cipher
    ) t

 * hive://hadoop@localhost:10000/text
Done.


[]

In [30]:
%sql select * from cipher_word limit 3

 * hive://hadoop@localhost:10000/text
Done.


word
gzwfxp
esp
bflyetej


In [31]:
%%sql cipher_frequency <<

SELECT
    character,
    count(character) as character_count,
    round(100 * count(character)/sum(count(character)) over (), 2) as percentage
    FROM (
        SELECT
            explode(split(word,'')) as character
        FROM
            cipher_word
        ) chars
WHERE
    character RLIKE '^[a-z]$'
GROUP BY 
    character
ORDER BY 
    character ASC


 * hive://hadoop@localhost:10000/text
Done.
Returning data to local variable cipher_frequency


In [33]:
df_cipher_frequency = cipher_frequency.DataFrame()

In [34]:
df_cipher_frequency.sort_values(by=['character_count'], ascending=False)

Unnamed: 0,character,character_count,percentage
15,p,176,13.46
4,e,141,10.78
11,l,118,9.02
19,t,88,6.73
14,o,84,6.42
24,z,79,6.04
23,y,79,6.04
3,d,77,5.89
2,c,74,5.66
21,w,59,4.51


## Hacking the Code
Now you know the top letter in our language and the top letter in the cipher.. Figure out the ASCII value difference between the two.

In [43]:
import string
for letter in string.ascii_lowercase:
    print(f"the letter {str(letter)} has ascii code {ord(letter)}")

the letter a has ascii code 97
the letter b has ascii code 98
the letter c has ascii code 99
the letter d has ascii code 100
the letter e has ascii code 101
the letter f has ascii code 102
the letter g has ascii code 103
the letter h has ascii code 104
the letter i has ascii code 105
the letter j has ascii code 106
the letter k has ascii code 107
the letter l has ascii code 108
the letter m has ascii code 109
the letter n has ascii code 110
the letter o has ascii code 111
the letter p has ascii code 112
the letter q has ascii code 113
the letter r has ascii code 114
the letter s has ascii code 115
the letter t has ascii code 116
the letter u has ascii code 117
the letter v has ascii code 118
the letter w has ascii code 119
the letter x has ascii code 120
the letter y has ascii code 121
the letter z has ascii code 122


### Write a Decoder

Write a small python script which reads `cipher.txt`, and decodes the code.

- convert letters into ascii code with `ord()`.
- convert ascii code to a chart with `chr()`.
- Note, only letters `a-z` have been converted, e.g. `,`, ` `, `( `, `)` should not be transformed.

In [59]:
import string


cipher = open("./cipher.txt", "r")

for line in cipher:
    decoded_line = []
    for c in line:
        if c in string.ascii_lowercase:
            base_zero = ord(c) - 97
            c = chr(((base_zero + 15) % 26) + 97)
        decoded_line.append(c)
    print("".join(decoded_line))

volume

the quantity of generated and stored data.

the size of the data determines the value and potential insight, and whether it can be considered big data or not.

the size of big data is usually larger than terabytes and petabytes.



variety

the type and nature of the data.

the earlier technologies like rdbmss were capable to handle structured data efficiently and effectively.

however, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies.

the big data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed (velocity), and huge in size (volume).

later, these tools and technologies were explored and used for handling structured data also but preferable for storage.

eventually, the processing of structured data was still kept as optional, either using big data or traditional rdbmss.

this helps in analyzin