In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("gla14.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Guided Learning Activity 14: Bayes' Rule

This Guided Learning Activity is designed for you to complete alongside a Data Ambassador from the course. You might find that it feels like a combination of the lectures and lab assignment. Whether you are participating live or watching the recording of the live meeting, let the Data Ambassador guide you through the following tasks. There will be moments for you to reflect and explore your own ideas as a way to solidify concepts and skills introduced by your instructor. Keep in mind that this is not a graded assignment for MATH 108 by default. If you have any concerns about participation, reach out to your instructor.

---

## Learning Objectives

1. Review Fundamentals of Probability Theory
2. Outline Bayes' Rule.
3. Calculate probability values using various probability rules (including Bayes' Rule).
4. Connect probability theory with generative AI.
5. Create a sentence generator using prior probabilities.
6. Consider how to extend the generator by incorporating a sentence review.

---

## Configure the Notebook

Run the following code cell to set up the notebook.

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Fundamentals of Probability

---

### Probability

- **Definition**:  
  A numerical measure of the chance that an event will occur.
- **Event**:  
  A collection of one or more possible outcomes.
- **Probability Values**:  
  Range from **0 (impossible)** to **1 (certain)**, or **0% to 100%**.

---

### Interpretations of Probability

#### Theoretical Probability

- Based on logical analysis of all possible outcomes.
- **Formula**:  
  $$
  P(\text{event}) = \frac{\text{Number of favorable outcomes}}{\text{Total number of possible outcomes}}
  $$

---

#### Empirical (Objective): Frequentist Probability

- Based on observed **relative frequencies**.
- **Formula**:  
  $$
  P(\text{event}) = \frac{\text{Number of times event occurred}}{\text{Total number of trials}}
  $$

---

#### Empirical (Subjective): Bayesian Probability

- Represents a **degree of belief**, updated with new evidence.
- Combines **prior knowledge, experience, and data**.
- Bayes' Rule is a way to **update probabilities when new information (evidence) is observed**.
$$
P(\text{event}\mid\text{evidence}) = \frac{P(\text{evidence}\mid \text{event}) \cdot P(\text{event})}{P(\text{evidence})}
$$

---

#### Components of Bayes' Rule

- $P(\text{event} \mid \text{evidence})$: **Posterior** (updated belief after seeing evidence)
- $P(\text{evidence} \mid \text{event})$: **Likelihood** (how likely is the evidence if the event is true?)
- $P(\text{event})$: **Prior** (initial belief before seeing the evidence)
- $P(\text{evidence})$: Evidence probability (overall chance of seeing the evidence)

---

#### Key Idea of Bayes' Rule

- **Posterior $\propto$ Likelihood $\times$ Prior**
    - Start with a **prior belief**.
    - Gather **new evidence (likelihood)**.
    - Update your belief to get the **posterior**.
- This is how Bayesian probability allows us to **combine experience, prior knowledge, and data to make informed decisions.**


---

### Task 01 📍🔎

<!-- BEGIN QUESTION -->

Suppose that we've observed a cat and collected the following data on their behavior over several day:

In [None]:
observations = Table.read_table('cat_behavior.csv')
observations

What is the probability that the cat's next behavior is `"slept on bed"`?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Task 02 📍🔎

<!-- BEGIN QUESTION -->

Given it's the **afternoon**, what is the probability that the cat will `"slept on bed"`?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Task 03 📍🔎

<!-- BEGIN QUESTION -->

Now, if you are told that **the cat is sleeping**, what is the chance that it is the afternoon?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Task 04 📍🔎

<!-- BEGIN QUESTION -->

Recalculate that same probability using Bayes' Rule. Since you have no concept of the time of day, assume that there is a 50/50 chance of it being the afternoon.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Task 05 📍🔎

<!-- BEGIN QUESTION -->

If you were told the cat was sleeping by having a picture shown to you and you noticed some fairly bright daylight in the photo, then your instinct should tell you that it is not just a 50/50 chance that it is the afternoon. Update the probability that it is the afternoon given that the cat is sleeping on the bed utilizing this new information.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

## Sentence Generation

Once we understand Bayes' Rule, we can see how probabilistic thinking applies to tasks like generating sentences with AI models.

---

### Generative AI

- At a high level, when an AI model like ChatGPT, Claude, etc. generates a sentence, image, or other output, it's **predicting what is most likely to come next based on patterns it has seen before**.
- For generating the next word in a sentence, the model is essentially calculating:

$$
P(\text{next word} \mid \text{previous words}) = \frac{P(\text{previous words} \mid \text{next word}) \times P(\text{next word})}{P(\text{previous words})}
$$

- This is **Bayes' Rule**, applied to language:
    - The model updates its belief about the next word using both:
        - **Prior**: How often that word appears overall.
        - **Likelihood**: How well that word would explain the previous words.


---

### Task 06 📍

Data in the form of examples is at the heart of generative AI. For the rest of this activity, suppose that you only have access to the following sentences that are written in the form "The \[noun\] \[verb\] \[prepositional phrase\]."

In [None]:
sentences = Table.read_table('sentences.csv')
sentences

Complete the following function that extracts the noun, verb, and prepositional phrase from those simple sentences.

**Hint**: The `split` and `join` string methods were used early on in MATH 108.

In [None]:
def noun_verb_prep(sentence):
    '''
    Extracts the noun, verb, and prepositional phrase from a simple sentence.

    Assumes the sentence has the format: 'The [noun] [verb] [prepositional phrase].'
    Returns a tuple: (noun, verb, prepositional phrase as a string).
    '''
    sentence_no_punc = sentence[:-1] # Removes the ending punctuation
    split_sentence = ...
    noun = ...
    verb = ...
    prep = ...
    return noun, verb, prep

# Test the function: It should return 'cat', 'jumped', 'onto the table'
noun_verb_prep('The cat jumped onto the table.')

In [None]:
grader.check("task_06")

---

### Task 07 📍

Create a table called `deconstructed` that is based on the `sentences` table and has the following 3 columns:
- `'noun'`: The noun from the relevant sentence in the `sentences` table
- `'verb'`: The verb from the relevant sentence in the `sentences` table
- `'prep_phrase'`: The prepositional phrase from the relevant sentence in the `sentences` table

**Hint**: Use `noun_verb_prep` and build up the requested table from a row perspective.

In [None]:
rows = ...
deconstructed = Table(['noun', 'verb', 'prep_phrase']).with_rows(...)
deconstructed

In [None]:
grader.check("task_07")

---

### Task 08 📍

Write a function `generate_verb` that generates the next verb in a sentence given a noun, using the observed data (a table like `deconstructed`) to calculate the likelihood (conditional probability) of the verb given the noun.

- Filter the table for the given noun.
- Calculate the probability of each verb appearing with that noun.
- Randomly choose a verb using the calculated probabilities.
- If the noun combination is not in the table, just pick a random verb from the table.


**Hint**: You can use `.group()` and add a probability column by dividing the counts by the total.

**Note**: The function `np.random.choice` has an optional parameter `p`, which is an array of probabilities. It lets the function randomly select an item from the list with weighted likelihoods, meaning items with higher probabilities are more likely to be chosen.

In [None]:
def generate_verb(noun, deconstructed):
    '''
    Generate a verb for the given noun using the conditional probability
    P(verb | noun) based on the deconstructed table.
    If the noun is not found, returns a random verb from the table.
    '''
    filtered = ...
    
    if filtered.num_rows == 0:
        return ...
    else:
        verb_counts = ...
        verb_counts = ...
        return ...

# Test the function. 
generate_verb('cat', deconstructed)

In [None]:
grader.check("task_08")

---

### Task 09 📍

Write a function `generate_prep_phrase` that predicts the prepositional phrase given the noun and the predicted verb.

- Filter the table for the given noun and verb.
- Calculate the probability of each prepositional phrase appearing with that noun and verb.
- Randomly choose a prepositional phrase using the calculated probabilities.
- If the noun and verb combination is not in the table, just pick a random prepositional phrase.

**Hint**: This is similar to the previous function.

In [None]:
def generate_prep_phrase(noun, verb, deconstructed):
    '''
    Predicts a prepositional phrase for the given noun and verb using
    P(prep_phrase | noun, verb). If not found, returns a random prep phrase.
    '''
    filtered = ...
    
    if filtered.num_rows == 0:
        return ...
    else:
        prep_counts = ...
        prep_counts = ...
        return ...

# Test the function. It should return 'The cat slept for the afternoon.'
generate_prep_phrase('cat', 'slept', deconstructed)

In [None]:
grader.check("task_09")

---

### Task 10 📍

Write a function `generate_sentence` that takes a noun and uses your two functions to generate a sentence in the format: "The \[noun\] \[verb\] \[prepositional phrase\]."

- First, use your `generate_verb` function to predict the verb based on the noun.
- Next, use your `generate_prep_phrase` function to predict the prepositional phrase based on the noun and verb.
- Return the full sentence.

In [None]:
def generate_sentence_prob(noun, deconstructed):
    '''
    Generates a full sentence:
    'The [noun] [verb] [prep_phrase].'
    by using generate_verb and generate_prep_phrase.
    '''
    verb = ...
    prep_phrase = ...
    return f"The {noun} {verb} {prep_phrase}."

# Test the function.
generate_sentence_prob('teacher', deconstructed)

In [None]:
grader.check("task_10")

---

### Task 11 📍🔎

<!-- BEGIN QUESTION -->

Since `'teacher'` is in the provided data table, this function is not generating a sentence without any context. It is using the established patterns and making a guess of what a likely sentence will be based on the sentences it has seen so far. Call the function a few times, you should see some sentences that don't make sense.

In [None]:
generate_sentence_prob('teacher', deconstructed)

Why do you think that happens? How might this connect to what large language models like ChatGPT, Claude, etc. do when they "hallucinate"?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Task 12 📍🔎

<!-- BEGIN QUESTION -->

How might human feedback (like corrections or ratings) be used to adjust the likelihoods in models like this one and improve the accuracy of future predictions? In other words, how can feedback help update the probabilities the model uses to generate better sentences?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

## Reflection

In this activity, you reviewed some of the basic ideas behind probability and focused on conditional probability and Bayes' Rule. You saw how probability theory is a core component to generative AI, and you built a sentence generator that uses example sentences and probability theory. Lastly, you reflected on how to incorporate likelihoods through a generated sentence review process to update your functions sentences.

---

## License

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a>.

<img src="./by-nc-sa.png" width=100px>