# M1.C4: Assignment: Python Coding \#1
Wyatt Blair

10/3/24

___

## Assignment

The goal of this assignment is to thoroughly tokenize the text (.txt file) Alice in
Wonderland, which can be found in the ’Programming Assignment 1’ module,
and create two frequency dictionaries:
1. A Token Frequency Dictionary: Case-sensitive tokens based on speific tokenization rules.

2. A Full Word Frequency Dictionary: Complete, correctly spelled words.
These will be python dictionaries. You must thoroughly comment on your
code, line-by-line, explaining new functions and what they do to process text.
You do NOT need to comment on well-known, trivial functions like the print
statement. However, you MUST explain how each regex or tokenization function
is changing the text.
Words in the ’Full Word Frequency Dictionary’ must be spelled correctly.
You will be graded at least partially on how correct and comprehensive this
dictionary is.
1

## Output Requirements

1. Token Frequency Dictionary
* Data Structure: Python dictionary.
* Keys: Case-sensitive tokens (e.g., "Alice", "ALICE", "said").
* Values: Frequency of each token (e.g., the number of occurrences).
* Punctuation: Include punctuation as standalone tokens.
* Example:
"Alice": 5, "said": 10, "ca": 3, "n’t": 3, ",": 15

2. Full Word Frequency Dictionary
* Data Structure: Python dictionary.
* Keys: Complete words.
* Values: Frequency of each complete word.
* Example:
"Alice": 5, "can’t": 3, "believe": 2

## Instructions

1. Tokenization:
* Tokenize the text of Alice in Wonderland using industry-standard rules:
– Split punctuation if it’s not part of the word (e.g., commas, periods).
– Keep contractions as separate tokens (e.g., "ca", "n’t").
– Maintain case sensitivity (e.g., "Alice" is distinct from "ALICE").
– Split hyphenated words unless they are common expressions or
proper nouns.
2. Dictionary Creation:
* Create a Token Dictionary that tracks the frequency of each token.
* Create a Full Word Dictionary that tracks the frequency of each
full word in the text (e.g., "can’t" remains one word in this dictionary).

## Submission Instructions

Submission Instructions
* Submit your assignment as a .py or .ipynb file.
* Upload your file to the Programming Assignment 1 section on the
course platform.
* I should be able to run your file by simply downloading it, and running it
in Google Colab.
* Include comments and explanations in your code of all unique regex, tokenization, and edit distance functions
* If you are using external libraries, include the installation commands at
the top of your file (e.g., !pip install <package>).
Due Date: October 16th, 2024
___

In [1]:
import string

In [2]:
with open('../data/alice_in_wonderland.txt', 'r') as f:
    alice_text = f.read()
    alice_text = alice_text.replace('\n', ' ')
    
    # remove any space greater than one
    while '  ' in alice_text:
        alice_text = alice_text.replace('  ', ' ')

In [3]:
def split_words(text):

    # replacing un-used punctuation with empty strings
    blacklist = string.punctuation.replace("'", "").replace("-", "")
    text = text.translate(str.maketrans('', '', blacklist))

    # split the text into words
    words: list[str] = [word for word in text.split(' ') if word]

    return words

In [4]:
def tokenize(text):

    tokens = []
    for word in split_words(text):

        new_tokens = []
        # split on apostrophe
        if "'" in word:
            
            punc_ind = word.index("'")
            
            # if the apsostraphe is at the beginning or end of the word
            if punc_ind == 0 or punc_ind == len(word)-1:
                new_tokens.append(word.replace("'", ''))
            # take letters on left and right of the apsostraphe
            else:
                left_token   = word[:punc_ind-1            ]
                center_token = word[ punc_ind-1: punc_ind+2]
                right_token  = word[ punc_ind+2:           ]

                new_tokens.extend([token for token in [left_token, center_token, right_token] if token])

        # split on hyphen
        elif "-" in word:
            
            punc_ind = word.index("-")
            left_word = word[:punc_ind]
            right_word = word[punc_ind+1:]

            # if the hyphen is at the beginning or end of the word
            if not left_word or not right_word:
                new_token = word.replace("-", "")
                new_tokens.append(new_token)

            # if both words are capitalized (proper noun)
            elif left_word[0].isupper() and right_word[0].isupper():
                new_tokens.append(word)
            
            # remove hyphen and add both words
            else:
                new_tokens.extend([left_word, right_word])
        
        # split on "ing"
        elif word[-3:].lower() == "ing":

            left_token = word[:-3]
            right_token = word[-3:]

            new_tokens.append(left_token)
            new_tokens.append(right_token)

        # no punctuation-- add whole word as token
        else:
            new_tokens.append(word)
    
        tokens.extend(token for token in new_tokens if token)
    return tokens


In [5]:
def token_frequency(tokens: list[str]) -> dict[str, int]:
    freq = {}
   
    for token in tokens:
        if token in freq:
            freq[token] += 1
        else:
            freq[token] = 1

    return freq


In [6]:
def word_frequency(text: str) -> dict[str, int]:
    
    clean_text = text.lower()
    words = split_words(clean_text)

    freq = {}
    for word in words:
        if word in freq:
            freq[word] += 1
        else:
            freq[word] = 1

    return freq

In [7]:
token_freq = token_frequency(tokenize(alice_text))

In [8]:
word_freq = word_frequency(alice_text)

In [9]:
PRINT_LIM = 100

In [10]:
print('TOKEN FREQUENCY:')
for i, (token, freq) in enumerate(token_freq.items()):
    print(f"-> {token}: {freq}")
    if i > PRINT_LIM: break

TOKEN FREQUENCY:
-> Alic: 10
-> e's: 52
-> Adventures: 3
-> in: 354
-> Wonderland: 3
-> ALIC: 3
-> E'S: 4
-> ADVENTURES: 1
-> IN: 2
-> WONDERLAND: 1
-> Lewis: 1
-> Carroll: 1
-> THE: 9
-> MILLENNIUM: 1
-> FULCRUM: 1
-> EDITION: 1
-> 30: 1
-> CHAPTER: 12
-> I: 424
-> Down: 3
-> the: 1534
-> Rabbit-Hole: 1
-> Alice: 386
-> was: 363
-> beginn: 11
-> ing: 851
-> to: 719
-> get: 44
-> very: 126
-> tired: 7
-> of: 494
-> sitt: 10
-> by: 54
-> her: 243
-> sister: 8
-> on: 190
-> bank: 3
-> and: 777
-> hav: 10
-> noth: 29
-> do: 123
-> once: 29
-> or: 75
-> twice: 4
-> she: 498
-> had: 184
-> peeped: 3
-> into: 67
-> book: 11
-> read: 13
-> but: 129
-> it: 483
-> no: 68
-> pictures: 4
-> conversations: 1
-> what: 90
-> is: 103
-> use: 18
-> a: 610
-> thought: 74
-> without: 26
-> conversation: 10
-> So: 27
-> consider: 4
-> own: 10
-> mind: 10
-> as: 246
-> well: 36
-> could: 82
-> for: 140
-> hot: 6
-> day: 26
-> made: 30
-> feel: 14
-> sleepy: 5
-> stupid: 5
-> whether: 11
-> pleasure: 2
-> 

In [11]:
print('TOKEN FREQUENCY (sorted):')
for i, (token, freq) in enumerate(sorted(token_freq.items(), key=lambda x: x[1], reverse=True)):
    print(f"-> {token}: {freq}")
    if i > PRINT_LIM: break

TOKEN FREQUENCY (sorted):
-> the: 1534
-> ing: 851
-> and: 777
-> to: 719
-> a: 610
-> she: 498
-> of: 494
-> it: 483
-> said: 453
-> I: 424
-> Alice: 386
-> was: 363
-> in: 354
-> you: 311
-> that: 259
-> as: 246
-> her: 243
-> n't: 216
-> at: 200
-> on: 190
-> had: 184
-> with: 175
-> all: 171
-> be: 164
-> for: 140
-> but: 129
-> not: 128
-> very: 126
-> so: 124
-> little: 124
-> do: 123
-> out: 116
-> this: 113
-> The: 110
-> they: 107
-> t's: 105
-> is: 103
-> down: 99
-> up: 98
-> he: 97
-> about: 94
-> his: 94
-> one: 92
-> what: 90
-> them: 87
-> were: 86
-> know: 85
-> like: 84
-> e: 84
-> went: 83
-> again: 83
-> herself: 83
-> could: 82
-> would: 82
-> have: 81
-> if: 77
-> or: 75
-> thought: 74
-> go: 74
-> did: 73
-> then: 71
-> when: 69
-> no: 68
-> time: 68
-> into: 67
-> see: 67
-> And: 67
-> Queen: 67
-> say: 64
-> off: 62
-> me: 61
-> K: 60
-> look: 58
-> began: 58
-> think: 57
-> I'm: 57
-> Turtle: 57
-> l: 56
-> its: 56
-> Mock: 56
-> my: 55
-> Gryphon: 55
-> by: 54

In [12]:
print('WORD FREQUENCY:')
for i, (word, freq) in enumerate(word_freq.items()):
    print(f"-> {word}: {freq}")
    if i > PRINT_LIM: break

WORD FREQUENCY:
-> alice's: 13
-> adventures: 6
-> in: 363
-> wonderland: 4
-> lewis: 1
-> carroll: 1
-> the: 1630
-> millennium: 1
-> fulcrum: 1
-> edition: 1
-> 30: 1
-> chapter: 12
-> i: 396
-> down: 97
-> rabbit-hole: 3
-> alice: 383
-> was: 352
-> beginning: 11
-> to: 715
-> get: 46
-> very: 143
-> tired: 7
-> of: 505
-> sitting: 10
-> by: 56
-> her: 247
-> sister: 8
-> on: 183
-> bank: 2
-> and: 843
-> having: 10
-> nothing: 32
-> do: 70
-> once: 29
-> or: 75
-> twice: 4
-> she: 536
-> had: 177
-> peeped: 3
-> into: 67
-> book: 5
-> reading: 3
-> but: 164
-> it: 478
-> no: 84
-> pictures: 4
-> conversations: 1
-> what: 131
-> is: 92
-> use: 18
-> a: 626
-> book': 2
-> thought: 74
-> without: 25
-> conversation': 1
-> so: 142
-> considering: 3
-> own: 10
-> mind: 7
-> as: 261
-> well: 53
-> could: 77
-> for: 148
-> hot: 5
-> day: 21
-> made: 30
-> feel: 8
-> sleepy: 5
-> stupid: 4
-> whether: 11
-> pleasure: 2
-> making: 8
-> daisy-chain: 1
-> would: 82
-> be: 139
-> worth: 4
-> t

In [13]:
print('WORD FREQUENCY (sorted):')
for i, (word, freq) in enumerate(sorted(word_freq.items(), key=lambda x: x[1], reverse=True)):
    print(f"-> {word}: {freq}")
    if i > PRINT_LIM: break

WORD FREQUENCY (sorted):
-> the: 1630
-> and: 843
-> to: 715
-> a: 626
-> she: 536
-> of: 505
-> it: 478
-> said: 457
-> i: 396
-> alice: 383
-> in: 363
-> was: 352
-> you: 340
-> as: 261
-> that: 249
-> her: 247
-> at: 210
-> on: 183
-> had: 177
-> with: 176
-> all: 171
-> but: 164
-> for: 148
-> very: 143
-> so: 142
-> be: 139
-> not: 132
-> what: 131
-> this: 129
-> they: 127
-> little: 125
-> he: 119
-> out: 113
-> down: 97
-> his: 96
-> one: 95
-> if: 94
-> up: 93
-> is: 92
-> about: 92
-> no: 84
-> went: 83
-> herself: 83
-> were: 83
-> would: 82
-> have: 80
-> when: 79
-> then: 79
-> like: 78
-> could: 77
-> or: 75
-> thought: 74
-> there: 73
-> them: 73
-> do: 70
-> off: 69
-> again: 68
-> into: 67
-> queen: 65
-> how: 64
-> see: 62
-> your: 62
-> know: 61
-> who: 61
-> time: 60
-> king: 60
-> don't: 59
-> did: 58
-> my: 58
-> its: 57
-> an: 57
-> began: 57
-> i'm: 57
-> by: 56
-> mock: 56
-> quite: 55
-> turtle: 55
-> gryphon: 55
-> me: 54
-> it's: 54
-> hatter: 54
-> well: 53