<div style="text-align: right">
    <i>
        Henry Pham 015500953<br>
        LING 165 Spring 2025 <br>
        
</div>

NOTE: yes i did steal lab5 for the initial setup part

# Text Normalization for Speech or Chatbots Using Finite-State Transducers

## Preliminaries
Today we will use the [Pynini Python package](https://github.com/kylebgorman/pynini) for working with finite state transducers (Gorman 2016, Gorman & Sproat 2021).



In [29]:
# Check what version of Python we are running (the Pynini needs at least Python 3.6)
import sys
print(sys.version)

3.11.12 (main, Apr  9 2025, 08:55:54) [GCC 11.4.0]


In [30]:
# You don't have to --- and shouldn't have to ---do this at home.
! pip install pynini



In [31]:
# You may want to do this once.
! pip install wurlitzer



In [32]:
# This is just a trick to make terminal error messages show up in the notebook.
# You don't need to use it unless you're using a notebook too.
# It won't work unless you've already installed wurlitzer as in the previous
# cell.
%load_ext wurlitzer

The wurlitzer extension is already loaded. To reload it, use:
  %reload_ext wurlitzer


In [33]:
import pynini as pn
import pandas as pd

# the rewrite library is gonna be handy
from pynini.lib import rewrite

In [34]:
# this is the gen-z dataset from Hugging Face I originally planned to use
df = pd.read_csv("hf://datasets/MLBtrio/genz-slang-dataset/all_slangs.csv")

In [35]:
# due to the scale of the dataset, I will focus on 10 Gen-Z Slang from https://keyhole.co/blog/top-genz-slangs/
# Ate, Bet, Bussin, Cheugy, Cringe, NPC, OOMF, Shook, Simp, TFW


# Ate, Bet, BFR, Bussin, Cap, Cheugy, Cringe, FR, IYKYK, NPC, OOMF, Shook, Simp, TBF, TFW
slang_words = ['Ate', 'Bet', 'BFR', 'Bussin', 'Cap', 'Cheugy', 'Cringe', 'FR', 'IYKYK', 'NPC', 'OOMF', 'Shook', 'Simp', 'TBF', 'TFW']
selected_slangs = df[df['Slang'].isin(slang_words)]

# the dataset has Bussin listed as the 24th row but for some reason when running it on the dataframe, it returns Bet.
# # grab the 24th Slang word from the dataset
# twenty_fourth_slang = df.iloc[23]['Slang']
# slang_words.append(twenty_fourth_slang)

# For some reason Bussin has a special character appended to it, now that I'm thinking about it,
# it is probably better to just preprocess the dataset into all lowercase but alas, I will just manually add in the row.
bussin_row = {
    'Slang': 'Bussin',
    'Description': 'Used to say something is good. Primarily used to describe food',
    'Example': "This pizza is bussin’!",
    'Context': 'Primarily used to compliment delicious food, but can apply to anything impressive.'
}
selected_slangs = pd.concat([selected_slangs, pd.DataFrame([bussin_row])], ignore_index=True)

# sort by the slang word
selected_slangs = selected_slangs.sort_values(by='Slang')
# remove item at index 8 (duplicate of NPC)
selected_slangs = selected_slangs.drop(index=8)
# reset the index
selected_slangs = selected_slangs.reset_index(drop=True)

selected_slangs

Unnamed: 0,Slang,Description,Example,Context
0,Ate,"If you “ate” something, it means you executed ...","She ate that performance, no crumbs left.","Popular in stan culture, especially in discuss..."
1,Bet,"Yes, ok, ""it's on.""",You want to meet at 6? Bet.,Used to confirm plans or agreements in a laid-...
2,Bussin,Used to say something is good. Primarily used ...,This pizza is bussin’!,"Primarily used to compliment delicious food, b..."
3,Cheugy,Derogatory term for Millennials. Used when mil...,"That phrase is so cheugy, no one says that any...",Used to refer to things that were once popular...
4,Cringe,A response to embarrassment or social awkwardness,That performance was so cringe.,"Often used to describe content, behavior, or s..."
5,NPC,Someone who cannot think for themselves and/or...,"She’s acting like such an NPC, just going with...",Used to describe people who seem to follow a s...
6,OOMF,One of my followers,OOMF just liked my tweet.,Refers to a follower on social media.
7,Shook,Surprised or shocked.,I was shook when I found out the news.,Often used to describe being emotionally or me...
8,Simp,"Sycophancy, being overly affectionate in pursu...","He buys her gifts every day, he’s such a simp.",Typically used as a derogatory term for someon...
9,TFW,That feeling when,TFW you finish a big project and can finally r...,Often paired with an image or caption to conve...


In [36]:
# shorthand description for each slang word to use in the FST
slang_descriptions = {
    'Ate': 'did well',
    'Bet': 'definitely',
    'Bussin': 'really good',
    'Cheugy': 'out of date',
    'Cringe': 'embarrassing',
    'NPC': "absent minded",
    'OOMF': "one of my followrs",
    'Shook': "surprised",
    'Simp': "in love loser",
    'TFW': "that feeling when"
}

# create a new dataframe with the slang words and their descriptions
slang_df = pd.DataFrame(list(slang_descriptions.items()), columns=['Slang', 'Description'])
slang_df

Unnamed: 0,Slang,Description
0,Ate,did well
1,Bet,definitely
2,Bussin,really good
3,Cheugy,out of date
4,Cringe,embarrassing
5,NPC,absent minded
6,OOMF,one of my followrs
7,Shook,surprised
8,Simp,in love loser
9,TFW,that feeling when


In [38]:
from pynini.lib import rewrite

# for each slang word in slang_df, I will create an FST that will convert the slang word to its english meaning
# Create a union of string mappings for each slang word to its description
slang_fst = pn.Fst()
for _, row in slang_df.iterrows():
    slang = row['Slang']
    desc = row['Description']
    pair_fst = pn.cross(slang, desc)
    if slang_fst.num_states() == 0:
        slang_fst = pair_fst
    else:
        slang_fst |= pair_fst

# Example usage: convert a slang word to its meaning
def slang_to_english(word):
    try:
        return rewrite.one_top_rewrite(word, slang_fst)
    except pn.lib.rewrite.Error:
        return word  # return original if not found

# Test
for slang in slang_df['Slang']:
    print(f"{slang} -> {slang_to_english(slang)}")

Ate -> did well
Bet -> definitely
Bussin -> really good
Cheugy -> out of date
Cringe -> embarrassing
NPC -> absent minded
OOMF -> one of my followrs
Shook -> surprised
Simp -> in love loser
TFW -> that feeling when


In [39]:
# Create an FST that replaces slang words in a sentence with their English meanings

# Tokenize the sentence, replace slang, and join back
def sentence_slang_to_english(sentence):
    tokens = sentence.split()
    translated_tokens = [slang_to_english(token) for token in tokens]
    return ' '.join(translated_tokens)

# Example usage
example_sentence = "She Ate that pizza and it was Bussin"
print(sentence_slang_to_english(example_sentence))

She did well that pizza and it was really good


In [40]:
# INITIAL ATTEMPT
word = pn.closure(pn.union(*"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'"), 1)
space = pn.accep(" ")
sentence_fst = word + pn.closure(space + word)

# compose with slang_fst to convert slang to English
def sentence_to_english_fst(sentence):
    tokens = sentence.split()
    translated_tokens = [rewrite.one_top_rewrite(token, slang_fst) if rewrite.one_top_rewrite(token, slang_fst) else token for token in tokens]
    return ' '.join(translated_tokens)

# test
test_sentence = "She Ate that pizza and it was Bussin"
converted = sentence_to_english_fst(test_sentence)
print(f"Converted: '{converted}'")

Error: Composition failure

In [46]:
# accepts any sequence of letters (upper/lowercase) and spaces, at least one word
word = pn.closure(pn.union(*"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'"), 1)
space = pn.accep(" ")
sentence_fst = word + pn.closure(space + word)

# compose with slang_fst to convert slang to English
def sentence_to_english_fst(sentence):
    # tokenize sentence
    tokens = sentence.split()
    # using tokenization instead of compositio because it just did not work.
    translated_tokens = []
    for token in tokens:
        try:
            translated = rewrite.one_top_rewrite(token, slang_fst)
            if translated == token:
                translated_tokens.append(token)
            else:
                translated_tokens.append(translated)
        except pn.lib.rewrite.Error:
            translated_tokens.append(token)
    return ' '.join(translated_tokens)

# test
test_sentence = "She Ate that pizza and it was Bussin"
converted = sentence_to_english_fst(test_sentence)
print(f"Converted: '{converted}'")

## AFTER TESTING THIS DID NOT ACCOUNT FOR PUNCTUATION BEING NEAR SLANG WORDS

Converted: 'She did well that pizza and it was really good'


In [52]:
# FINAL ATTEMPT, THIS ACCOUNTS FOR PUNCTUATION
# the idea, break into words, add spaces where there are punctuations, if its a word then translate it, and then reconstruct where needed due to the punctuation

import re

def sentence_to_english_fst(sentence):
    tokens = re.findall(r"\w+|[^\w\s]", sentence, re.UNICODE)
    translated_tokens = []
    for token in tokens:
        if token.isalpha():
            try:
                translated = rewrite.one_top_rewrite(token, slang_fst)
                translated_tokens.append(translated)
            except pn.lib.rewrite.Error:
                translated_tokens.append(token)
        else:
            translated_tokens.append(token)
    result = ""
    for i, token in enumerate(translated_tokens):
        if i > 0 and (token.isalnum() or translated_tokens[i-1].isalnum()):
            result += " "
        result += token
    return result

# test
test_sentence = "She Ate that pizza, and it was Bussin!"
converted = sentence_to_english_fst(test_sentence)
print(f"Converted: '{converted}'")

Converted: 'She did well that pizza , and it was really good!'


In [None]:
# example sentences
sentence1 = "That food is Bussin"
sentence2 = "That show was so Cheugy"
sentence3 = "He's such a Simp"
sentence4 = "This sentence has no slang"
sentence5 = "This food is Bussin FR"
sentence6 = "That show was Cheugy and TFW you see it"

In [48]:
example_sentences = [
    "She Ate that performance, it was amazing!",
    "Bet, I can finish this assignment by tomorrow.",
    "This burger is absolutely Bussin, you have to try it!",
    "Wearing skinny jeans and a side part? That's so Cheugy.",
    "His awkward attempt at a joke was pure Cringe.",
    "Sometimes I feel like an NPC just going through the motions.",
    "Shoutout to OOMF who shared that funny meme.",
    "I was completely Shook when I heard the news.",
    "He's being such a Simp for her, it's kinda sad.",
    "TFW you finally get to relax after a long week."
]

In [51]:
print("\nTesting example sentences:")
for sentence in example_sentences:
    print(f"Original: '{sentence}'")
    converted = sentence_to_english_fst(sentence)
    print(f"Converted: '{converted}'")
    print("-" * 20)


Testing example sentences:
Original: 'She Ate that performance, it was amazing!'
Converted: 'She did well that performance , it was amazing !'
--------------------
Original: 'Bet, I can finish this assignment by tomorrow.'
Converted: 'definitely , I can finish this assignment by tomorrow .'
--------------------
Original: 'This burger is absolutely Bussin, you have to try it!'
Converted: 'This burger is absolutely really good, you have to try it !'
--------------------
Original: 'Wearing skinny jeans and a side part? That's so Cheugy.'
Converted: 'Wearing skinny jeans and a side part ? That ' s so out of date.'
--------------------
Original: 'His awkward attempt at a joke was pure Cringe.'
Converted: 'His awkward attempt at a joke was pure embarrassing .'
--------------------
Original: 'Sometimes I feel like an NPC just going through the motions.'
Converted: 'Sometimes I feel like an absent minded just going through the motions .'
--------------------
Original: 'Shoutout to OOMF who sh

----

In [None]:
# Okay So I thought I pushed my code from last night but I infact did not, I will now be shifting gears towards a youtube video that I enjoyed
# the video entailed a speech to an audience of highschool students using gen-z slang. the reason I chose this speech was that it wasn't only fun
# to watch but also because it has a direct translation to what is meant in Standard English.
# NOTE: for the sake of scale, I will only be using 1 or 2 sentences, but the entire speech and speech translation is located in Xiaomanyc.md

# Gen-Alpha Slang  
No cap, speaking another language let's you go off. Turning you into an absolute conversational Rizzlord. So yeah chat, that's the sauce. Keep cooking, stay goated, never be mid, and FR study hard and go rizz up that knowledge.

# STANDARD ENGLISH  
Honestly, speaking another language lets you truly shine. Turning you into an expert communicator. So yes, everyone. That's the main idea. Keep growing, stay exceptional, never settle for average and truly study hard and master that knowledge.

## main gen-z slang terms / phrases used -> transaltion  
- No cap -> Honestly
- go off -> truly shine
- Rizzlord -> expert
- Chat -> everyone
- sauce -> main idea
- cooking -> growing
- goated / stay goated -> exceptional
- mid -> average / settle for average
- FR -> truly
- rizz / rizz up -> master

## how do these compare to the dataset from Hugging face?

- No cap -> Honestly  
    - Cap: a lie or exaggeration
- go off -> truly shine
    - not in dataset
- Rizzlord -> expert
    - Rizz: One's courtship/seduction skills
- Chat -> everyone  
    - not in dataset
- sauce -> main idea
    - not in dataset
- cooking -> growing
    - not in dataset
- goated / stay goated -> exceptional
    - GOAT: greatest of all time
- mid -> average / settle for average
    - Mid: short for mediocre
- FR -> truly
    - not in datast
- rizz / rizz up -> master
    - Rizz; One's courtship/seduction

### Observations / takeaways from the comparison
So to start, only 5 of the slang terms existed in the dataset and despite trying every possible permutation of the slang word that might have the same meaning. A good consideration is because the Youtube Speech by Xiao is using "Gen-Alpha" terms whereas the dataset is for "Gen-Z Slang", so the disparity could be that Gen-Alpha and Gen-Z slang are different. This differes from my general assumption that Gen-Z and Gen-Alpha slang are one and the same and would have a lot of overlap. Since this finding breaks one of of mine initial assumptions a good idea would be to find a new dataset that would provide a definition for terms used in the Gen-Alpha speech but since I am short on time and brain power I will instead include this idea in a "Future Works" Section.

In [54]:
# Gen-Alpha speech examples
gen_alpha_slang_descriptions = {
    'No cap': 'Honestly',
    'go off': 'truly shine',
    'Rizzlord': 'expert',
    'Chat': 'everyone',
    'sauce': 'main idea',
    'cooking': 'growing',
    'goated': 'exceptional',
    'mid': 'average',
    'FR': 'truly',
    'rizz up': 'master'
}

# dataframe for the Gen-Alpha slang descriptions
gen_alpha_slang_df = pd.DataFrame(list(gen_alpha_slang_descriptions.items()), columns=['Slang', 'Description'])
print("Gen-Alpha Slang DataFrame:")
print(gen_alpha_slang_df)

# creating the fst by iterating over the dataframe
gen_alpha_slang_fst = pn.Fst()
for _, row in gen_alpha_slang_df.iterrows():
    slang = row['Slang']
    desc = row['Description']
    # Handle multi-word slang by treating the phrase as a single unit
    pair_fst = pn.cross(slang, desc)
    if gen_alpha_slang_fst.num_states() == 0:
        gen_alpha_slang_fst = pair_fst
    else:
        gen_alpha_slang_fst |= pair_fst


gen_alpha_slang_fst.optimize()

def transform_sentence_with_gen_alpha_fst(sentence):

    transformed_sentence = sentence
    # Iterate through the slang terms, prioritizing longer ones to avoid partial matches
    for index, row in gen_alpha_slang_df.sort_values(by='Slang', key=lambda x: x.str.len(), ascending=False).iterrows():
        slang = row['Slang']
        desc = row['Description']
        transformed_sentence = transformed_sentence.replace(slang, desc)

    return transformed_sentence

# test
gen_alpha_sentence = "No cap, speaking another language let's you go off. Turning you into an absolute conversational Rizzlord. So yeah Chat, that's the sauce. Keep cooking, stay goated, never be mid, and FR study hard and go rizz up that knowledge."
gen_alpha_expected = "Honestly, speaking another language let's you truly shine. Turning you into an absolute conversational expert. So yeah everyone, that's the main idea. Keep growing, stay exceptional, never be average, and truly study hard and go master that knowledge."

print("\nTesting Gen-Alpha Sentence Transformation:")
print(f"Original: {gen_alpha_sentence}")
transformed_gen_alpha = transform_sentence_with_gen_alpha_fst(gen_alpha_sentence)
print(f"Transformed: {transformed_gen_alpha}")
print(f"Expected:    {gen_alpha_expected}")

Gen-Alpha Slang DataFrame:
      Slang  Description
0    No cap     Honestly
1    go off  truly shine
2  Rizzlord       expert
3      Chat     everyone
4     sauce    main idea
5   cooking      growing
6    goated  exceptional
7       mid      average
8        FR        truly
9   rizz up       master

Testing Gen-Alpha Sentence Transformation:
Original: No cap, speaking another language let's you go off. Turning you into an absolute conversational Rizzlord. So yeah Chat, that's the sauce. Keep cooking, stay goated, never be mid, and FR study hard and go rizz up that knowledge.
Transformed: Honestly, speaking another language let's you truly shine. Turning you into an absolute conversational expert. So yeah everyone, that's the main idea. Keep growing, stay exceptional, never be average, and truly study hard and go master that knowledge.
Expected:    Honestly, speaking another language let's you truly shine. Turning you into an absolute conversational expert. So yeah everyone, that's th