# Lab 0.4 File IO

## Objective

1. Read information from files using Python
2. Use regular expressions to extract information from text
3. Create files using Python

*The challenge section and "just for fun" section are optional.*

## Rubric

- 6 pts - Contains all required components and uses professional language
- 5 pts - Contains all required components, but uses unprofessional language, formating, etc. 
- 4 pts - Contains some, but not all, of the required components
- 3 pts - Did not submit

## Part 1: Letter Frequency

A Caesar cipher, or a shift cipher, is one of the simplest encryption techniques. This method is named after Julius Caesar who would use it to send private messages. To encrypt information with a Caesar cipher, each letter in your message or plaintext is replaced by a letter a fixed numbers of positions away in the alphabet to generate your ciphertext.

For example, if I wanted to encrypt the message `ECHO` using a left shift of 3, I would rewrite each character by shifting the entire alphabet left by 3 characters. Using the chart and key below, we can see that `E -> B`, `C -> Z`, `H -> E`, and `O -> L`. So `ECHO` becomes `BZEL`.

![Pasted image 20231227102315](https://github.com/gormes-EPIC/FileIO-CSV-DSF/assets/134316348/36015604-5669-475c-a8c6-3d4674da98d4)
- Plaintext:  ABCDEFGHIJKLMNOPQRSTUVWXYZ
- Ciphertext: XYZABCDEFGHIJKLMNOPQRSTUVW

We can use the same cipher to encrypt the plaintext `THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG` as the ciphertext `QEB NRFZH YOLTK CLU GRJMP LSBO QEB IXWV ALD`. Then decrypt it using our key in the other direction and shifting right by 3.

As long as whoever is reading the message knows you have shifted the alphabet left by 3, it is straightforward to decrypt `BZEL` as `ECHO`. But what if you intercepted this message and didn't know the original shift? By exploiting patterns in the English language, we can actually decrypt Caesar ciphers without knowing the original shift. [Source](https://www.101computing.net/caesar-cipher/)


### Your Task

One way to break a Caesar cipher is to look at the frequency of the letters. In a typical English text, some letters are much more frequent that others.

To create your frequency table you will:

1. Using [Project Gutenburg](https://www.gutenberg.org/) download at least one book into your directory. *Hint: Once you navigate to a book, copy the URL of the Plain Text UTF-8 download and user the `wget` command in your terminal.*
2. Open your book using Python, count each of the letters, and create a frequency table.
3. After you are done, print out the information.

#### Example Output

```
A: 1023
B: 356
C: 40
...
```



In [24]:
# grabbing the all the lines from the King James Bible
with open("King_James_Bible.txt", "r") as Bible:
    lines = Bible.readlines()


# establishing a table for all letters
frequency_table = {
 'A': 0, 'B': 0, 'C': 0, 'D': 0, 'E': 0, 'F': 0, 'G': 0, 'H': 0, 
 'I': 0, 'J': 0, 'K': 0, 'L': 0, 'M': 0, 'N': 0, 'O': 0, 'P': 0, 
 'Q': 0, 'R': 0, 'S': 0, 'T': 0, 'U': 0, 'V': 0, 'W': 0, 'X': 0, 'Y': 0, 'Z': 0
}


# iterating through the King James Bible and counting up the letters
for line in lines:
    for char in line:
        if char.isalpha():
            frequency_table[char.upper()] += 1


# printing out the frequency_table
for letter, occurences in frequency_table.items():
    print(f"{letter}: {occurences}")

A: 275833
B: 48899
C: 55109
D: 158123
E: 412518
F: 83601
G: 55326
H: 282825
I: 194024
J: 8903
K: 22326
L: 130021
M: 79904
N: 225131
O: 243399
P: 43313
Q: 964
R: 170390
S: 190150
T: 317924
U: 83508
V: 30372
W: 65508
X: 1479
Y: 58595
Z: 2975


### Just for Fun! Break this Caesar Cipher

Decode the following ciphertext. Start by using the frequency table you just made and matching the most popular letters with the letters from above. *Tip: In addition to using your letter frequency table from above to help you, look at the 1 and 2 letter words carefully. There are limited options those characters could be! Also, look try to identify frequently used words like `THE` or `AND` in your ciphertext.*

  Ciphertext:

```

PA PZ H WLYPVK VM JPCPS DHY. YLILS ZWHJLZOPWZ, ZAYPRPUN MYVT H OPKKLU IHZL, OHCL DVU AOLPY MPYZA CPJAVYF HNHPUZA AOL LCPS NHSHJAPJ LTWPYL. KBYPUN AOL IHAASL, YLILS ZWPLZ THUHNLK AV ZALHS ZLJYLA WSHUZ AV AOL LTWLYVY'Z BSAPTHAL DLHWVU, AOL KLHAO ZAHY, HU HYTVYLK ZWHJL ZAHAPVU DPAO LUVBNO WVDLY AV KLZAYVF HU LUAPYL WSHULA. WBYZBLK IF AOL LTWLYVY'Z ZPUPZALY HNLUAZ, WYPUJLZZ SLPH YHJLZ OVTL HIVHYK OLY ZAHYZOPW, JBZAVKPHU VM AOL ZAVSLU WSHUZ AOHA JHU ZHCL OLY WLVWSL HUK YLZAVYL MYLLKVT AV AOL NHSHEF ....

```

In [25]:
x = 1
def break_cipher(cipher):
    # creating a sorted by number of occurances table
    letter_frequency = dict(sorted(frequency_table.items(), reverse=True, key=lambda item: item[1]))


    # establishing a table for all letters in the cipher
    frequency_table_scrambled = {
    'A': 0, 'B': 0, 'C': 0, 'D': 0, 'E': 0, 'F': 0, 'G': 0, 'H': 0, 
    'I': 0, 'J': 0, 'K': 0, 'L': 0, 'M': 0, 'N': 0, 'O': 0, 'P': 0, 
    'Q': 0, 'R': 0, 'S': 0, 'T': 0, 'U': 0, 'V': 0, 'W': 0, 'X': 0, 'Y': 0, 'Z': 0
    }


    # iterating through the cipher and counting up the letters
    for char in cipher:
        if char.isalpha():
            frequency_table_scrambled[char.upper()] += 1


    # creating a sorted by number of occurances table
    sorted_table_scrambled = dict(sorted(frequency_table_scrambled.items(), reverse=True, key=lambda item: item[1]))
    

    # changing all the letters via letter_frequency
    print(type(sorted_table_scrambled.keys()), letter_frequency.keys())
    real = list(letter_frequency.keys())
    scrambled = list(sorted_table_scrambled.keys())
    cipher = list(cipher)

    for let_ind in range(26):
        for char in range(len(cipher)):
            if cipher[char] == scrambled[let_ind]:
                cipher[char] = real[let_ind]
    

    return str(cipher)


cipher = "PA PZ H WLYPVK VM JPCPS DHY. YLILS ZWHJLZOPWZ, ZAYPRPUN MYVT H OPKKLU IHZL, OHCL DVU AOLPY MPYZA CPJAVYF HNHPUZA AOL LCPS NHSHJAPJ LTWPYL. KBYPUN AOL IHAASL, YLILS ZWPLZ THUHNLK AV ZALHS ZLJYLA WSHUZ AV AOL LTWLYVY'Z BSAPTHAL DLHWVU, AOL KLHAO ZAHY, HU HYTVYLK ZWHJL ZAHAPVU DPAO LUVBNO WVDLY AV KLZAYVF HU LUAPYL WSHULA. WBYZBLK IF AOL LTWLYVY'Z ZPUPZALY HNLUAZ, WYPUJLZZ SLPH YHJLZ OVTL HIVHYK OLY ZAHYZOPW, JBZAVKPHU VM AOL ZAVSLU WSHUZ AOHA JHU ZHCL OLY WLVWSL HUK YLZAVYL MYLLKVT AV AOL NHSHEF ...."
decoded = break_cipher(cipher)
print(decoded)

<class 'dict_keys'> dict_keys(['E', 'T', 'H', 'A', 'O', 'N', 'I', 'S', 'R', 'D', 'L', 'F', 'U', 'M', 'W', 'Y', 'G', 'C', 'B', 'P', 'V', 'K', 'J', 'Z', 'X', 'Q'])
['W', 'P', ' ', 'W', 'J', ' ', 'H', ' ', 'B', 'K', 'A', 'W', 'Z', 'V', ' ', 'Z', 'P', ' ', 'U', 'W', 'C', 'W', 'L', ' ', 'B', 'H', 'A', '.', ' ', 'A', 'K', 'Z', 'K', 'L', ' ', 'J', 'B', 'H', 'U', 'K', 'J', 'J', 'W', 'B', 'J', ',', ' ', 'J', 'P', 'A', 'W', 'J', 'W', 'L', 'W', ' ', 'P', 'A', 'Z', 'P', ' ', 'H', ' ', 'J', 'W', 'V', 'V', 'K', 'L', ' ', 'Z', 'H', 'J', 'K', ',', ' ', 'J', 'H', 'C', 'K', ' ', 'B', 'Z', 'L', ' ', 'P', 'J', 'K', 'W', 'A', ' ', 'P', 'W', 'A', 'J', 'P', ' ', 'C', 'W', 'U', 'P', 'Z', 'A', 'V', ' ', 'H', 'W', 'H', 'W', 'L', 'J', 'P', ' ', 'P', 'J', 'K', ' ', 'K', 'C', 'W', 'L', ' ', 'W', 'H', 'L', 'H', 'U', 'P', 'W', 'U', ' ', 'K', 'P', 'B', 'W', 'A', 'K', '.', ' ', 'V', 'Y', 'A', 'W', 'L', 'W', ' ', 'P', 'J', 'K', ' ', 'Z', 'H', 'P', 'P', 'L', 'K', ',', ' ', 'A', 'K', 'Z', 'K', 'L', ' ', 'J', 'B', 'W', 'K

## Part 2: Analyzing Server Activity

One important way for businesses to keep themselves secure is to monitor their server logs.

Read in `server_log.txt` containing server access logs with entries like "IP Address-Timestamp-Page Accessed". Notice which character we are using as a delimiter.

- Count the total number of unique IP addresses that accessed the server.
- Identify the top three most used IP addresses.
- Generate a report file `server_summary.txt` containing this information.

In [26]:
with open("server_log.txt", "r") as logs:
    lines = logs.readlines()

IPs = {}
for line in lines[1:]:
    IP = line[:line.index("-")]
    if IP in IPs.keys():
        IPs[IP] += 1 
    else:
        IPs[IP] = 1

top_three = [(0,0), (0,0), (0,0)]
for IP, occurences in IPs.items():
    for place in range(3):
        if occurences > top_three[place][1]:
            top_three[place] = (IP, occurences)
            break

with open("server_summary.txt", "w") as summary:
    summary.write(f"Total number of unique IP addresses that accessed the server: {len(IPs.keys())}\n\nTop three most common IP addresses: {top_three[0][0]} with {top_three[0][1]}, {top_three[1][0]} with {top_three[1][1]}, and {top_three[2][0]} with {top_three[2][1]}")

## Part 3: Creating Usernames

Use the file `emails.txt` to create a list of usernames and random passwords for each user. Then, output the emails, usernames, and random passwords into an output file `output.txt`.

The usernames should be the same username as the email. So for  `findlay_butler@hr.yahoo.com`, his username would be `findlay_butler`.

The passwords should be 8 characters long and a random combination of letters and numbers. 

For the first user, `output.txt` should look like: 
```
findlay_butler@hr.yahoo.com,findlay_butler,abiojash
```

### Challenge: Using Regular Expressions

Instead of using the email username as their user account, their username should be their first initial and their last name instead. So for `findlay_butler@hr.yahoo.com`  the username would be `fbutler`. The easiest way to do this is probably **using regular expressions.** 

For more explanation and practice with regular expressions, use [regexone.com](https://regexone.com/). For help creating your regular expression query, use [regex101.com](https://regex101.com/). 

In [27]:
import re
import secrets


with open("emails.txt", "r") as file:
    emails = file.readlines()


lines = []
for email in emails:
    match = re.search(r"([a-zA-Z])([a-zA-Z]+)_([a-zA-Z]+)@", email)
    if match:
        full = match.group(0)
        username = match.group(1)+match.group(3)
        password = secrets.token_urlsafe(8)
        email = email.replace("\n", "")
        line = (email, username, password)
        lines.append(line)
        

with open("output.txt", "w") as output:
    for line in lines:
        output.write(line[0] + ", " + line[1] + ", " + line[2] + "\n")