# Caesar Cipher

Make sure you understand the [Caesar cipher](https://en.wikipedia.org/wiki/Caesar_cipher).

In this exercise we will need:

- Hive QL. Use this [help](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) to find functions you need in this exercise.
- Pandas
- External and Temporary Tables. [Info 1](https://sparkbyexamples.com/apache-hive/hive-temporary-table-usage-and-how-to-create/) and [Info2](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/using-hiveql/content/hive_create_a_hive_temporary_table.html)

In [None]:
%load_ext sql
%config SqlMagic.autocommit=False # avoiding the error: FAILED: IllegalStateException COMMIT is not supported yet.
%sql hive://hadoop@localhost:10000/text

In [None]:
# we will use pandas in this exercise 
import pandas as pd

## The Problem

You have received a cipher, stored in `cipher.txt`
:
```bash
!head cipher.txt
```

and you are pretty sure that it was decoded with Caesar cipher.

## Create a Mapping: Character -> Frequency

Using the `word_gutenberg` dataset we would like to create a mapping for each character:


|character|character_count|percentage|
| ----------- | ----------- |----------- |
|e|978448690|8.38|
|...|...|...|
|x|14541510|0.19|

save this information into the variable `real_frequency`.

### Bonus Question
Why is it wrong to use the dataset `word_count_gutenberg`?

In [None]:
%%sql real_frequency <<

SELECT
...


## Convert `real_frequency` to a DataFrame

In [None]:
df_real_frequency = real_frequency.DataFrame()

In [None]:
df_real_frequency

## Most Used Character in the English Language

Given `df_real_frequency`, sort the `DataFrame` based on the `character_count` (`DESC`) to find the most used characters.
- [Help](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [None]:
#sorting in pandas

### Compare our Results with the Official List

Compare our results with this [wiki site](https://en.wikipedia.org/wiki/Letter_frequency).

The top three letters should match!

## Loading `cipher.txt` as a Temp Table

In [None]:
! head cipher.txt

### Creating a Temp Table

In [None]:
%sql CREATE TEMPORARY EXTERNAL TABLE cipher(line string)

### Loading Data into the Table

In [None]:
%sql LOAD DATA LOCAL INPATH '/home/hadoop/BDLC_FS23/V04/V04_exercises_material/1_Text_Analysis/cipher.txt' INTO TABLE cipher

### Check if it worked

In [None]:
%sql show tables

In [None]:
%sql select * from cipher limit 5

## Frequencies for Cipher

Do the same analysis again. Save the results into `cipher_frequency` and also convert it to a DataFrame.

In [None]:
%%sql cipher_frequency <<

SELECT
   ...

In [None]:
df_cipher_frequency = cipher_frequency.DataFrame()

In [None]:
#sorting in pandas

## Hacking the Code
Now you know the top letter in our language and the top letter in the cipher.. Figure out the ASCII value difference between the two.

In [None]:
import string
for letter in string.ascii_lowercase:
    print(f"the letter {str(letter)} has ascii code {ord(letter)}")

### Write a Decoder

Write a small python script which reads `cipher.txt`, and decodes the code.

- convert letters into ascii code with `ord()`.
- convert ascii code to a char with `chr()`.
- Note, only letters `a-z` have been converted, e.g. `,`, ` `, `( `, `)` should not be transformed.

In [None]:
import string


cipher = open("./cipher.txt", "r")

for line in cipher:
