<a href="https://colab.research.google.com/github/bu-cds-llms/portfolio-piece-1-albhoe/blob/submissionbranch/lab_1_revision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Movie Review Sentiment Analysis

- Bag of Words - Convert text to numbers
- TF-IDF - Weight words by importance  
- N-grams - Capture phrases
- **Build a Real Classifier** - Train and test!

**Dataset:** IMDB Movie Reviews (5,000 reviews)



## Setup

In [None]:
# Libraries
import numpy as np
import pandas as pd
from collections import Counter
from typing import List, Dict
import re

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Libraries loaded


## Load IMDB Dataset

In [None]:
print('Downloading IMDB dataset directly...\n')

import urllib.request
import tarfile
import os

# Load reviews from files
def load_imdb_data(path, num_samples=2500):
    reviews = []

    # Load positive reviews
    pos_path = os.path.join(path, 'train', 'pos')
    for i, filename in enumerate(os.listdir(pos_path)[:num_samples]):
        with open(os.path.join(pos_path, filename), 'r', encoding='utf-8') as f:
            reviews.append({'review': f.read(), 'sentiment': 'positive'})

    # Load negative reviews
    neg_path = os.path.join(path, 'train', 'neg')
    for i, filename in enumerate(os.listdir(neg_path)[:num_samples]):
        with open(os.path.join(neg_path, filename), 'r', encoding='utf-8') as f:
            reviews.append({'review': f.read(), 'sentiment': 'negative'})

    return reviews

# Download if not present
if not os.path.exists('aclImdb'):
    print('Downloading IMDB dataset (this may take a minute)...')
    url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
    urllib.request.urlretrieve(url, 'aclImdb_v1.tar.gz')

    # Extract
    with tarfile.open('aclImdb_v1.tar.gz', 'r:gz') as tar:
        tar.extractall()
    print('‚úì Downloaded!')


reviews_data = load_imdb_data('aclImdb', num_samples=2500)
data = pd.DataFrame(reviews_data)
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

print(f' Loaded {len(data)} reviews')

Downloading IMDB dataset directly...

 Loaded 5000 reviews


## Exercise 1: Bag of Words

### 1.1 Preprocessing

In [None]:
def preprocess_text(text: str) -> List[str]:
    """
    Clean and tokenize text. Converts strings containing html into list of token strings.
    """
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)

    # Convert to lowercase and split
    tokens = text.lower().split()

    # Remove punctuation
    cleaned = []
    for token in tokens:
        clean = re.sub(r'[^\w\s]', '', token)
        if clean:
            cleaned.append(clean)

    return cleaned

### 1.2 Build Vocabulary

In [None]:
def build_vocabulary(documents: List[str], drop_features: int = 0, max_features: int = 2000) -> Dict[str, int]:
    """
    Build vocabulary from documents. Converts a list of strings (documents) into a dictionary containing max_features most common entries.
    The dictionary holds the count of the times that token appears in the corpus. The dictionary ignores the drop_features most common entries.
    """
    # Iterate through all tokens in all documents. Run counters for every distinct token.
    vocabulary = Counter()
    for doc in documents:
        tokens = preprocess_text(doc)
        for token in tokens:
            vocabulary[token] += 1
    #Convert counter into dictionary containing max_features entries and ignoring drop_features most common entries.
    return {word:count for (word,count) in vocabulary.most_common(max_features+drop_features)[min(drop_features,len(vocabulary)):]}

# Test
print('Building vocabulary...')
vocab = build_vocabulary(data['review'].tolist(), drop_features = 90, max_features=8000)
print(f'Vocabulary size: {len(vocab)}')
print('Sample vocabulary items:', list(vocab.items())[:10])



Building vocabulary...
Vocabulary size: 8000
Sample vocabulary items: [('then', 1601), ('make', 1596), ('movies', 1594), ('films', 1585), ('any', 1543), ('way', 1529), ('after', 1495), ('characters', 1488), ('could', 1483), ('too', 1465)]


### 1.3 Vectorize Documents

In [None]:
def vectorize_document(document: str, vocabulary: Dict[str, int]) -> np.ndarray:
    """
    Convert document to vector
    """
    #Initialize vector
    vector = np.zeros(len(vocabulary))

    # Iterate through tokens and record the number of times a word appears in the document
    tokens = preprocess_text(document)
    word_counts = Counter()
    for token in tokens:
        if token in vocabulary:
            word_counts[token] += 1

    # Populate vector by the length of each axis representing the number of times the corresponding word appears
    for word, count in word_counts.items():
        if word in vocabulary:
              vector[list(vocabulary.keys()).index(word)] = count

    return vector

# Vectorize all reviews in corpus
print('Vectorizing reviews...')
X_bow = np.array([vectorize_document(r, vocab) for r in data['review']])
print(f'‚úì Matrix shape: {X_bow.shape}')
print('Sample vector:', X_bow[0][:10])

Vectorizing reviews...
‚úì Matrix shape: (5000, 8000)
Sample vector: [1. 0. 0. 1. 0. 2. 1. 0. 1. 2.]


## Exercise 2: TF-IDF

### 2.1 Calculate TF

In [None]:
def calculate_tf(document: str) -> Dict[str, float]:
    """
    Calculate term frequency
    """
    tokens = preprocess_text(document)
    total = len(tokens)
    if total == 0:
        return {}

    # Calculate TF for each word
    tf_scores = {word:0 for word in set(tokens)}
    for word in set(tokens):
        tf_scores[word] += 1/total

    return tf_scores

### 2.2 Calculate IDF

In [None]:
def calculate_idf(documents: List[str], vocabulary: Dict[str, int]) -> Dict[str, float]:
    """
    Calculate inverse document frequency
    """
    total_docs = len(documents)
    doc_count = {word: 0 for word in vocabulary.keys()}

    # Count documents containing each word
    for doc in documents:
        unique_words = set(preprocess_text(doc))
        for word in unique_words:
            if word in doc_count:
                doc_count[word] = doc_count[word] + 1

    # Calculate Inverse Document Frequency of each word
    idf_scores = {}
    for word, count in doc_count.items():
        idf_scores[word] = np.log((total_docs)/(count))

    return idf_scores

print('Calculating IDF...')
idf_scores = calculate_idf(data['review'].tolist(), vocab)
print(' IDF calculated!')
print('Sample IDF scores:', list(idf_scores.items())[:10])

Calculating IDF...
 IDF calculated!
Sample IDF scores: [('then', np.float64(1.4881056272665754)), ('make', np.float64(1.4130490984287103)), ('movies', np.float64(1.5186835491656363)), ('films', np.float64(1.5925807953676774)), ('any', np.float64(1.4731602941415525)), ('way', np.float64(1.4550015591296814)), ('after', np.float64(1.510497964579197)), ('characters', np.float64(1.5521128458148308)), ('could', np.float64(1.484568930388231)), ('too', np.float64(1.5474025215146476))]


### 2.3 Calculate TF-IDF Vectors

In [None]:
def calculate_tfidf_vector(document: str, vocabulary: Dict[str, int], idf_scores: Dict[str, float]) -> np.ndarray:
    """
    Calculate TF-IDF vector
    """
    vector = np.zeros(len(vocabulary))
    tf_scores = calculate_tf(document)

    # TF-IDF = TF √ó IDF
    for word, tf in tf_scores.items():
        if word in vocabulary:
            idx = list(vocabulary.keys()).index(word)
            idf = idf_scores.get(word, 0)
            vector[idx] = tf * idf

    return vector

print('Creating TF-IDF vectors...')
X_tfidf = np.array([calculate_tfidf_vector(r, vocab, idf_scores) for r in data['review']])
print(f'TF-IDF matrix shape: {X_tfidf.shape}')

Creating TF-IDF vectors...
TF-IDF matrix shape: (5000, 8000)


## N-grams

In [None]:
def extract_ngrams(text: str, n: int) -> List[str]:
    """
    Convert text into a list of n-gram strings, which is every subsequence of n token length contained in the text.
    """
    tokens = preprocess_text(text)
    ngrams = []

    # Extract n-grams
    for i in range(len(tokens) - n + 1):
        ngrams.append(' '.join(tokens[i:i+n]))

    return ngrams

# Test
print('Bigrams:', extract_ngrams('This movie was not good', 2))

Bigrams: ['this movie', 'movie was', 'was not', 'not good']


## Exercise 4: Build Classifier

### 4.1 Prepare Data

In [None]:
# Convert labels to numbers
y = (data['sentiment'] == 'positive').astype(int).values

# Split dataset
X_bow_train, X_bow_test, y_train, y_test = train_test_split(
    X_bow, y, test_size=0.2, random_state=42, stratify=y
)

X_tfidf_train, X_tfidf_test, _, _ = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42, stratify=y
)

print(f'Training: {len(X_bow_train)}, Test: {len(X_bow_test)}')

Training: 4000, Test: 1000


### 4.2 Train with Bag of Words

In [None]:
# Train Bag of Words model
nb_bow = MultinomialNB()
nb_bow.fit(X_bow_train, y_train)

# Evaluate
y_pred_bow = nb_bow.predict(X_bow_test)
accuracy_bow = accuracy_score(y_test, y_pred_bow)

print(f'BoW Accuracy: {accuracy_bow:.4f} ({accuracy_bow*100:.1f}%)')
print('\n', classification_report(y_test, y_pred_bow, target_names=['Negative', 'Positive']))

BoW Accuracy: 0.8650 (86.5%)

               precision    recall  f1-score   support

    Negative       0.86      0.87      0.87       500
    Positive       0.87      0.86      0.86       500

    accuracy                           0.86      1000
   macro avg       0.87      0.86      0.86      1000
weighted avg       0.87      0.86      0.86      1000



### 4.3 Train with TF-IDF

In [None]:
# Train TF-IDF model
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_tfidf_train, y_train)

# Evaluate
y_pred_tfidf = nb_tfidf.predict(X_tfidf_test)
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)

print(f'TF-IDF Accuracy: {accuracy_tfidf:.4f} ({accuracy_tfidf*100:.1f}%)')
print('\n', classification_report(y_test, y_pred_tfidf, target_names=['Negative', 'Positive']))

TF-IDF Accuracy: 0.8820 (88.2%)

               precision    recall  f1-score   support

    Negative       0.86      0.91      0.89       500
    Positive       0.91      0.85      0.88       500

    accuracy                           0.88      1000
   macro avg       0.88      0.88      0.88      1000
weighted avg       0.88      0.88      0.88      1000



### 4.4 Test on Your Reviews!

In [None]:
def predict_sentiment(review: str, model, use_tfidf=True):
    if use_tfidf:
        vector = calculate_tfidf_vector(review, vocab, idf_scores)
    else:
        vector = vectorize_document(review, vocab)
    vector = vector.reshape(1, -1)
    pred = model.predict(vector)[0]
    prob = model.predict_proba(vector)[0]
    return pred, prob

# Test reviews
test_reviews = [
    'This movie was absolutely incredible! Best film ever!',
    'Terrible waste of time. Boring and poorly made.',
    'Masterpiece! Brilliant acting and story!'
]

print('üé¨ Testing Reviews:\n')
for review in test_reviews:
    pred, prob = predict_sentiment(review, nb_tfidf)
    sentiment = 'POSITIVE' if pred == 1 else ' NEGATIVE'
    conf = prob[pred] * 100
    print(f'{review}')
    print(f'  ‚Üí {sentiment} ({conf:.1f}%)\n')

üé¨ Testing Reviews:

This movie was absolutely incredible! Best film ever!
  ‚Üí POSITIVE (58.4%)

Terrible waste of time. Boring and poorly made.
  ‚Üí  NEGATIVE (82.8%)

Masterpiece! Brilliant acting and story!
  ‚Üí POSITIVE (61.2%)



## Test on a 'copypasta' meme reviewing the video game, Hogwarts Legacy

In [None]:
my_review = "Hogwarts Legacy is a game where the unforgivable is made understandable, where the ghost gets going, and where ninety percent of all battles are fought using barrel based blunt force trauma. In this game, you play as Hogwarts Legacy who is given supernatural abilities by her tulpa and the unnaturally Korean janitor as she attends the most complex and deadly secondary school known to man. As the new student for the not-Unseen University, you're given the job of researching and containing supernatural history textbooks with minimal or no ministry oversight and what appears to be Praetorian Guard succession law. You assume this mantle to fight Ranrok, a red-tinted malevolent force of death, to stop him from breaching the Jedi Survivor room. Much like any problem, this can be solved with the power of love. But stronger than any love is our protagonist's extreme love for hurling barrels at literal breakneck velocity, and in the process of destroying her enemy, she conducts more barrelling than Napa Valley. Here at the Ministry of Magic, our exorcisms are performed with extreme prejudice. Hogwarts Legacy is one of the best looter shooters that I have played in years, and to explain why, we're going hog wild‚Äîso don't take this as legitimate consumer advice, because I am as qualified and intelligent as a Hogwarts Professor. Instead, I ask you to join me in untangling the thrilling combat, the stellar visuals, the unusual plot, and a user interface more malformed than Charles II. Now, this game is surprisingly simple. You‚Äôre rarely just in a room, shooting people, because that‚Äôs actually fun. The rest is a bunch of pre-animated takedowns, cinematic set pieces, following an NPC for six hours (please, Nicholas, speed up). Every shootout is up to you, and that's what makes the spectacle significantly more impressive than the Left 4 Dead 2 pre-rendered gas station explosion. Sorry guys, I was filming for a video, ‚ÄúCan we, like, kick this dude?‚Äù In a way, this game is basically only possible because of modern CPUs. If I showed this to my father ten years ago, he would have an aneurysm. In those dark times, we were confined by Skyrim's weakness to only render ten thousand cheese wheels at once. You'll probably need a good computer to render it, since every Revelio displays at least fifty Merlin Trials upon casting. DLSS is also a miracle technology for when I don't want my computer to etch ones and zeros on a stone tablet. I highly recommend it. This aesthetic is very appealing because this game doesn't have the slave labor budget of Hogwarts‚Äô kitchen; therefore, it sticks to relatively clean interiors that you subsequently trash like your local McDonald's bathroom. Just don't enter the abandoned shop. Our adventure takes place in a vast continent called Hogwarts Valley, where the Royal Ministry of MI6 Wizards conducts ethical experiments on the supernatural like sissy hypnosis. Oooooh, you want to hit the bell. It's a gateway between our world and the wizarding, which is why white people immediately colonized it. There is variety, but what is there is tailor-made to make you feel like a basilisk in an orphanage. Enemies explode into colored smoke, rooms glow with beautiful lighting across a mirror sheen, and you can throw everyone. You can even help the disabled‚Ä¶ into the fucking ground. It's art direction that makes this game pop, and it's all about that nostalgia. So make sure you fill your PC with an adequate number of beans. You are the agent of destruction of the game and your computer, and the visuals only emphasize this. But it would be nothing without that sick gameplay. Battlefield Four is a great-looking game, but I struggle to theorize how the AI climbed out of the abortion bin. Fortunately, we're playing a game that has more than two buttons, so it's time to explain Lesson One in Video James: The Buttons. There's the aforementioned Mach Seven barrel toss, but that's not all. You are forbidden to fly in this game. Do you know how many options that would unlock? The paint buckets think they're safe, and suddenly this bastard's leaping directly among them. There are some fancy, less cool abilities like a teleport, a shield, and the ability to transfigure people into explosives. It‚Äôs for the greater good. But what about the spells? Doesn't Team Fortress Two have spells? Well, the emphasis isn't really on the spells. You have the generic spells, maybe a fireball, or maybe you change class to Cleric and die anyways because you suck. And then you have Venomous Tentacula, which vomits acid into the enemy at Mach Seven. No point in using anything else anymore; you just have to join Garden Warfare. Shut the up, Crazy Dave. You can use your spells on basic enemies, but there is intense variety with enemies that have shields, enemies with more health, and enemies with both. But the most threatening enemies are capable of harnessing the strongest powers in the entire game: Mother Nature. Diffindo is a scratch and Incendio is laughable compared to the sheer and comprehensively damaging effects of the amphibian attack. They are genuinely serious threats because they can survive your strongest attack, which is also nature. Le- let‚Äôs get back to that later.But by far, the most engaging enemy in the game is the [Devil‚Äôs snare incomprehensibly screeching] because you can‚Äôt stop them. These all culminate to form a chaotic, fun mess of utter destruction with traits, upgrades, and talent unlock-. Okay, I lied to you; we're actually doing this now. This video is being hijacked by the bad fairy. The spectacle of the game and the first-time experience is amazing, downright fantastic, but there is a serious problem: the single and only tactic the player can do requires a return to monke. Throwing cabbages is objectively, always, in every scenario, the best strategy possible. Why do you think all my footage is just throwing cabbages? There are talents, but herbology is objectively the best talent. So why do anything else? All other tactics are effectively LARPing that you want to win, because of all things, Hufflepuff gaming is how you win every fight every time. This will grate you by the end, and it makes repeated playthroughs repetitious. But Max0r, why can't you just use a different strategy if throwing cabbages gets so boring? Well, this wouldn't be a problem in a normal game designed by a non-Anglican ethnicity. However, this is an enemy at the beginning of the game, and this is an enemy near the end. I fully upgraded this hat; effectively, enemies scale faster than you can upgrade your gear, which makes you weaker and more pathetic by the end of the game. There is, however, one strategy that actually gets more effective as you level it. Can you guess what it is? To get the exact same dopamine as you did at the beginning, you have to use cabbage or you're literally gimping yourself. It's like going to the zoo but you're told the only way to see other animals is to jump into the chimp exhibit. I should be allowed to feel more powerful as the game goes on, especially if you fill the game with literal RPG mechanics. Do not get me started on the traits‚Äîninety percent of them are trash, like 'Curse target on slow' or 'Explode five percent more.' 'Get five percent more money on Christmas if it‚Äôs a Tuesday.‚Äù On my first playthrough, I went the entire game without finding a single Herbology trait. The grind has literally no relationship to success whatsoever, so consider ignoring it. What can't be ignored is our motivation. Why are we here? Where is the nearest Walmart? How to evade Aurors after create bomb? These answers and more can be found in the lore‚Äîa strange tale of a janitor and a whore. I love this game for the experience that it gave me once, and the story is a huge part of that. Hogwarts is a building out of time, shifting as you progress, a barrier between us and the cosmos. We appear to be under the influence of terrifying cosmic beings that control its operations called The Keepers. These beings are inhuman, non-Euclidean voices originating from a gigantic inverted portrait in the Map Chamber. You need a goddamn thesaurus to describe them, and everyone just kind of listens to the commands for seemingly no reason. Sorry, Joe Biden. In real life, the painting gets the girl. In the midst of this, there's also a cosmic war going on between beings of sound. The first is Ranrok, a force of death which invades and occupies Hogwarts to spread his dominion. The other is Ancient Magic. Ancient Magic is trapped inside a spheroid. This necessitates the conscription of our protagonist to fight the cosmic war for them‚Äîat least we have a game. To make matters worse, there is yet another mysterious extra-dimensional presence lingering in the halls: the most powerful of them all, John Williams. It's a big plot point that he plays seven notes; that's how strong he is. The only issue is that this unfolds due to fetch quests exclusively. 'Oh, Jessie, we need the moonstone.' 'Go get the moonstone.' 'Oh, I gotta find Anne.' 'Oh boy, we're in Azkaban now.' 'Oh, it looks like Deek apperated by himself past me back to exactly where I began the quest,' causing all my time to be wasted. In reality, the plot is uniquely simple, and you could probably remove Ranrok from the story. There's this one moment where Ranrok expresses doubt for your cause‚Äîthe Keepers killed Jackdaw, they attacked you, and then they made you Keeper within three seconds of walking in. The story is a struggle between a good side and an evil side. Every salient point that Ranrok makes is kind of shrugged off as him being just crazy, and Jesse's just like, 'Oh man, you know the guys who killed Jackdaw? I'll work for them without questioning that.' Almost had nuance or moral complexity there‚Äîgood thing that I *whew* leviosa-ed over it. Now all I have to do is put Lodgok‚Äôs brother in a coma. Stop screaming. This is all slightly offset by the presence of one Professor Garlick, whose‚Ä¶ personality is so strong I can only assume the game was built around her. When I became tired of the Ranrok saga, Garlick was there dancing on a projector in my mind palace. I found myself drawn to her, loving every second of her performance, and I think I know why: The protagonist has the personality of a fish. She could suck the rainbow out of a pride parade. The books try to help with this, but no, they don't. When I review novelizations, they're often bombastic, new, and interesting, but Hogwarts Legacy expansions suffer from the malaise of the late game. They typically drag on and I'm just bored. They even got Newt Scamander in the movie where they cause World War Two. That's supposed to be a fun sentence, but the writing is identical. The Harry Potter Saga tries to shake things up by adding a new enemy, but instead of blending it into the combat, the book sends a babbling bumbling band of baboons to fight you all at once. POV: you are a minor on Discord. With all this in mind, I can't really recommend that you play Hogwarts- WRONG! That's right, this game is making a comeback. Sometimes entire games can be boiled down to just one moment‚Äîone pure example that stays with you forever and makes the experience truly worth playing. For me, there is nothing more emblematic of this than Hedwig‚Äôs Theme. So if you're interested in playing Hogwarts Legacy, here is the magic word: this is when the game dives headlong into the absurdity, the action, and everything else that makes this game actually good. And if you don't enjoy Hogwarts Legacy, at least you can take away this: every mechanic, every encounter was all building up to the moment that you cast Revelio and let it play. And yes, this music is in the game. The game might not be worth playing, but Hedwig‚Äôs Theme is a recommend. I would like to thank the kind and truthful members of the royal government for adequately funding my clandestine operations in Albania. If you'd like to contribute towards the understanding of the supernatural objects that are my videos, you can head to my Patreon to learn more. Thank you all for watching and screaming, and of course, mermaids are real."

pred, prob = predict_sentiment(my_review, nb_tfidf)
sentiment = ' POSITIVE' if pred == 1 else 'NEGATIVE'
print(f'Your review: {my_review}')
print(f'Prediction: {sentiment} ({prob[pred]*100:.1f}%)')

Your review: Hogwarts Legacy is a game where the unforgivable is made understandable, where the ghost gets going, and where ninety percent of all battles are fought using barrel based blunt force trauma. In this game, you play as Hogwarts Legacy who is given supernatural abilities by her tulpa and the unnaturally Korean janitor as she attends the most complex and deadly secondary school known to man. As the new student for the not-Unseen University, you're given the job of researching and containing supernatural history textbooks with minimal or no ministry oversight and what appears to be Praetorian Guard succession law. You assume this mantle to fight Ranrok, a red-tinted malevolent force of death, to stop him from breaching the Jedi Survivor room. Much like any problem, this can be solved with the power of love. But stronger than any love is our protagonist's extreme love for hurling barrels at literal breakneck velocity, and in the process of destroying her enemy, she conducts more

Generate New Text based on N-Ggram

In [None]:
import random
from collections import defaultdict

class NGramTextGenerator:
    """
    A simple N-gram based text generator that uses character-level N-grams
    to predict the next character based on a given prefix.
    This simulates a basic Naive Bayes-like approach for text generation.
    """
    def __init__(self, n=3):
        self.n = n  # N-gram size
        self.ngrams = defaultdict(lambda: defaultdict(int)) # Stores counts of next characters for each n-gram
        self.ngram_counts = defaultdict(int) # Stores total counts for each n-gram

    def fit(self, text):
        """
        Trains the model on the input text by building n-gram counts.
        """
        if len(text) < self.n:
            print(f"Warning: Input text is too short for n={self.n}. Consider a smaller n-gram size or longer text.")
            return

        for i in range(len(text) - self.n):
            prefix = text[i : i + self.n]
            next_char = text[i + self.n]
            self.ngrams[prefix][next_char] += 1
            self.ngram_counts[prefix] += 1

    def _get_next_char(self, prefix):
        """
        Predicts the next character based on the given prefix (n-gram).
        Uses probabilities derived from the training data.
        """
        if prefix not in self.ngrams or not self.ngrams[prefix]:
            # Fallback: if n-gram not found, try shorter n-grams or return a random char
            # For simplicity, we'll pick a random character from the training set or a common one.
            # A more robust solution would involve back-off or smoothing.
            if self.ngrams:
                # Get all possible next characters from trained data
                all_possible_chars = set(char for next_chars_dict in self.ngrams.values() for char in next_chars_dict.keys())
                if all_possible_chars:
                    return random.choice(list(all_possible_chars))
            return random.choice('abcdefghijklmnopqrstuvwxyz ')

        possible_next_chars = self.ngrams[prefix]
        total_count = self.ngram_counts[prefix]

        # Create a list of characters weighted by their frequency
        choices = []
        for char, count in possible_next_chars.items():
            choices.extend([char] * count)

        return random.choice(choices) if choices else self._get_next_char_fallback()

    def _get_next_char_fallback(self):
        """
        Fallback for when no next character can be predicted from the model.
        Picks a random character from the entire set of observed characters or common ones.
        """
        all_possible_chars = set()
        for next_chars_dict in self.ngrams.values():
            all_possible_chars.update(next_chars_dict.keys())
        if all_possible_chars:
            return random.choice(list(all_possible_chars))
        return random.choice('abcdefghijklmnopqrstuvwxyz ')

    def generate(self, seed_text, length=100):
        """
        Generates new text starting with the seed_text.
        """
        generated_text = list(seed_text.lower())
        current_prefix = generated_text[-(self.n):] # Get the last n characters as prefix

        for _ in range(length - len(seed_text)):
            if len(current_prefix) < self.n:
                # If seed_text is shorter than n, we need to pad or use a different strategy
                # For now, we'll just pick a random character until prefix is long enough
                next_char = self._get_next_char_fallback()
            else:
                next_char = self._get_next_char("".join(current_prefix))

            generated_text.append(next_char)
            current_prefix = generated_text[-(self.n):] # Update the prefix

        return "".join(generated_text)

# --- Example Usage ---

# 1. Define input string (mimics style based on based on n-gram frequency)
input_string = (
    "The quick brown fox jumps over the lazy dog. "
    "A stitch in time saves nine. "
    "Never underestimate the power of a good book. "
    "The early bird catches the worm. "
    "All that glitters is not gold. "
    "To be or not to be, that is the question." * 5
)

# 2. Initialize the generator with an n-gram size (e.g., 3 for trigrams)
# A higher 'n' means the generated text will be more faithful to the input, but less creative.
# A lower 'n' means more creativity but potentially less coherence.
generator = NGramTextGenerator(n=3)

# 3. Train the generator with your input string
print("Training the model...")
generator.fit(input_string.lower()) # Convert to lowercase for consistent training
print("Model trained.\n")

# 4. Generate a new string
# Provide a seed text to start the generation. It should be at least 'n' characters long.
seed = "the quick brown"
generated_output = generator.generate(seed, length=150)

print(f"Seed: '{seed}'")
print(f"Generated text (length {len(generated_output)}):")
print(generated_output)


# --- Another Example ---
input_string_2 = ("""Hogwarts Legacy is a game where the unforgivable is made understandable, where the ghost gets going, and where ninety percent of all battles are fought using barrel based blunt force trauma. In this game, you play as John Legacy who is given supernatural abilities by her tulpa and the unnaturally Korean janitor as she attends the most complex and deadly secondary school known to man.
As the new student for the not-Unseen University, you're given the job of researching and containing supernatural history textbooks with minimal or no ministry oversight and what appears to be Praetorian Guard succession law. You assume this mantle to fight Ranrok, a red-tinted malevolent force of death, to stop him from breaching the Jedi Survivor room. Much like any problem, this can be solved with the power of love. But stronger than any love is our protagonist's extreme love for hurling barrels at literal breakneck velocity, and in the process of destroying her enemy, she conducts more barrelling than Napa Valley. Here at the Ministry of Magic, our exorcisms are performed with extreme prejudice.
Hogwarts Legacy is one of the best looter shooters that I have played in years, and to explain why, we're going hog wild‚Äîso don't take this as legitimate consumer advice, because I am as qualified and intelligent as a Hogwarts Professor. Instead, I ask you to join me in untangling the thrilling combat, the stellar visuals, the unusual plot, and a user interface more malformed than Charles II.
Now, this game is surprisingly simple. You‚Äôre rarely just in a room, shooting people, because that‚Äôs actually fun. The rest is a bunch of pre-animated takedowns, cinematic set pieces, following an NPC for six hours (please, Nicholas, speed up). Every shootout is up to you, and that's what makes the spectacle significantly more impressive than the Left 4 Dead 2 pre-rendered gas station explosion.
Sorry guys, I was filming for a video,
‚ÄúCan we, like, kick this dude?‚Äù
In a way, this game is basically only possible because of modern CPUs. If I showed this to my father ten years ago, he would have an aneurysm. In those dark times, we were confined by Skyrim's weakness to only render ten thousand cheese wheels at once. You'll probably need a good computer to render it, since every Revelio displays at least fifty Merlin Trials upon casting. DLSS is also a miracle technology for when I don't want my computer to etch ones and zeros on a stone tablet. I highly recommend it.
This aesthetic is very appealing because this game doesn't have the slave labor budget of Hogwarts‚Äô kitchen; therefore, it sticks to relatively clean interiors that you subsequently trash like your local McDonald's bathroom. Just don't enter the abandoned shop. Our adventure takes place in a vast continent called Hogwarts Valley, where the Royal Ministry of MI6 Wizards conducts ethical experiments on the supernatural like sissy hypnosis.
Oooooh, you want to hit the bell.
It's a gateway between our world and the wizarding, which is why white people immediately colonized it. There is variety, but what is there is tailor-made to make you feel like a basilisk in an orphanage. Enemies explode into colored smoke, rooms glow with beautiful lighting across a mirror sheen, and you can throw everyone. You can even help the disabled‚Ä¶ into the fucking ground. It's art direction that makes this game pop, and it's all about that nostalgia. So make sure you fill your PC with an adequate number of beans. You are the agent of destruction of the game and your computer, and the visuals only emphasize this.
But it would be nothing without that sick gameplay. Battlefield Four is a great-looking game, but I struggle to theorize how the AI climbed out of the abortion bin. Fortunately, we're playing a game that has more than two buttons, so it's time to explain Lesson One in Video James: The Buttons. There's the aforementioned Mach Seven barrel toss, but that's not all. You are forbidden to fly in this game. Do you know how many options that would unlock? The paint buckets think they're safe, and suddenly this bastard's leaping directly among them. There are some fancy, less cool abilities like a teleport, a shield, and the ability to transfigure people into explosives.
It‚Äôs for the greater good.
But what about the spells? Doesn't Team Fortress Two have spells? Well, the emphasis isn't really on the spells. You have the generic spells, maybe a fireball, or maybe you change class to Cleric and die anyways because you suck. And then you have Venomous Tentacula, which vomits acid into the enemy at Mach Seven. No point in using anything else anymore; you just have to join Garden Warfare. Shut the up, Crazy Dave. You can use your spells on basic enemies, but there is intense variety with enemies that have shields, enemies with more health, and enemies with both. But the most threatening enemies are capable of harnessing the strongest powers in the entire game: Mother Nature. Diffindo is a scratch and Incendio is laughable compared to the sheer and comprehensively damaging effects of the amphibian attack. They are genuinely serious threats because they can survive your strongest attack, which is also nature.
Le- let‚Äôs get back to that later.
But by far, the most engaging enemy in the game is the [Devil‚Äôs snare incomprehensibly screeching] because you can‚Äôt stop them.
These all culminate to form a chaotic, fun mess of utter destruction with traits, upgrades, and talent unlock-. Okay, I lied to you; we're actually doing this now. This video is being hijacked by the bad fairy. The spectacle of the game and the first time experience is amazing, downright fantastic, but there is a serious problem: the single and only tactic the player can do requires a return to monke. Throwing cabbages is objectively, always, in every scenario, the best strategy possible. Why do you think all my footage is just throwing cabbages? There are talents, but herbology is objectively the best talent. So why do anything else? All other tactics are effectively LARPing that you want to win, because of all things, Hufflepuff gaming is how you win every fight every time. This will grate you by the end, and it makes repeated playthroughs repetitious.
But Max0r, why can't you just use a different strategy if throwing cabbages gets so boring? Well, this wouldn't be a problem in a normal game designed by a non-Anglican ethnicity. However, this is an enemy at the beginning of the game, and this is an enemy near the end. I fully upgraded this hat. effectively, enemies scale faster than you can upgrade your gear, which makes you weaker and more pathetic by the end of the game. There is, however, one strategy that actually gets more effective as you level it. Can you guess what it is? To get the exact same dopamine as you did at the beginning, you have to use cabbage or you're literally gimping yourself. It's like going to the zoo but you're told the only way to see other animals is to jump into the chimp exhibit. I should be allowed to feel more powerful as the game goes on, especially if you fill the game with literal RPG mechanics. Do not get me started on the traits‚Äîninety percent of them are trash, like "Curse target on slow" or "Explode five percent more." ‚ÄúGet five percent more money on Christmas if it‚Äôs a Tuesday.‚Äù On my first playthrough, I went the entire game without finding a single Herbology trait. The grind has literally no relationship to success whatsoever, so consider ignoring it.
What can't be ignored is our motivation. Why are we here? Where is the nearest Walmart? How to evade Aurors after create bomb? These answers and more can be found in the lore‚Äîa strange tale of a janitor and a whore. I love this game for the experience that it gave me once, and the story is a huge part of that.
Hogwarts is a building out of time, shifting as you progress, a barrier between us and the cosmos. We appear to be under the influence of terrifying cosmic beings that control its operations called The Keepers. These beings are inhuman, non-Euclidean voices originating from a gigantic inverted portrait in the Map Chamber. You need a goddamn thesaurus to describe them, and everyone just kind of listens to the commands for seemingly no reason. Sorry, Joe Biden. In real life, the painting gets the girl.
In the midst of this, there's also a cosmic war going on between beings of sound. The first is Ranrok, a force of death which invades and occupies Hogwarts to spread his dominion. The other is Ancient Magic. Ancient Magic is trapped inside a spheroid. This necessitates the conscription of our protagonist to fight the cosmic war for them‚Äîat least we have a game. To make matters worse, there is yet another mysterious extra-dimensional presence lingering in the halls: the most powerful of them all, John Williams. It's a big plot point that he plays seven notes. that's how strong he is.
The only issue is that this unfolds due to fetch quests exclusively. "Oh, Jessie, we need the moonstone." "Go get the moonstone." "Oh, I gotta find Anne." "Oh boy, we're in Azkaban now." "Oh, it looks like Deek apperated by himself past me back to exactly where I began the quest," causing all my time to be wasted. In reality, the plot is uniquely simple, and you could probably remove Ranrok from the story. There's this one moment where Ranrok expresses doubt for your cause‚Äîthe Keepers killed Jackdaw, they attacked you, and then they made you Keeper within three seconds of walking in. The story is a struggle between a good side and an evil side. Every salient point that Ranrok makes is kind of shrugged off as him being just crazy, and Jesse's just like, "Oh man, you know the guys who killed Jackdaw? I'll work for them without questioning that." Almost had nuance or moral complexity there‚Äîgood thing that I leviosa-ed over it. Now all I have to do is put Lodgok‚Äôs brother in a coma.
Stop screaming.
This is all slightly offset by the presence of one Professor Garlick, whose‚Ä¶ personality is so strong I can only assume the game was built around her. When I became tired of the Ranrok saga, Garlick was there dancing on a projector in my mind palace. I found myself drawn to her, loving every second of her performance, and I think I know why: The protagonist has the personality of a fish. She could suck the rainbow out of a pride parade. The books try to help with this, but no, they don't. When I review novelizations, they're often bombastic, new, and interesting, but Hogwarts Legacy expansions suffer from the malaise of the late game. They typically drag on and I'm just bored. They even got Newt Scamander in the movie where they cause World War Two. That's supposed to be a fun sentence, but the writing is identical. The Harry Potter Saga tries to shake things up by adding a new enemy, but instead of blending it into the combat, the book sends a babbling bumbling band of baboons to fight you all at once. POV: you are a minor on Discord.
With all this in mind, I can't really recommend that you play Hogwarts- WRONG! That's right, this game is making a comeback. Sometimes entire games can be boiled down to just one moment one pure example that stays with you forever and makes the experience truly worth playing. For me, there is nothing more emblematic of this than Hedwig‚Äôs Theme. So if you're interested in playing Hogwarts Legacy, here is the magic word: this is when the game dives headlong into the absurdity, the action, and everything else that makes this game actually good. And if you don't enjoy Hogwarts Legacy, at least you can take away this: every mechanic, every encounter was all building up to the moment that you cast Revelio and let it play. And yes, this music is in the game.
The game might not be worth playing, but Hedwig‚Äôs Theme is a recommend. I would like to thank the kind and truthful members of the royal government for adequately funding my clandestine operations in Albania. If you'd like to contribute towards the understanding of the supernatural objects that are my videos, you can head to my Patreon to learn more. Thank you all for watching and screaming, and of course, mermaids are real.
"""
    )

generator_2 = NGramTextGenerator(n=3)
generator_2.fit(input_string_2.lower())

seed_2 = "Hogwarts Legacy is a game"
generated_output_2 = generator_2.generate(seed_2, length=200)

print(f"\n\nSeed 2: '{seed_2}'")
print(f"Generated text 2 (length {len(generated_output_2)}):")
print(generated_output_2)

Training the model...
Model trained.

Seed: 'the quick brown'
Generated text (length 150):
the quick brown fox jumps over that is not glitters is nine. never the early bird catch in time saves not glitters is not to be or not to be or not go


Seed 2: 'Hogwarts Legacy is a game'
Generated text 2 (length 200):
hogwarts legacy is a game, new now. the game is make thin them.
the unlock? the is good smost the actic, every fathere aurus they typicall slowings of bablem with alband thanger to superier of deadly 
