# Dataset Review and Training Preprocessing

This notebook helps you review the dataset and prepare that data for training.

After completing you should understand the basics of:

1. **Pandas** — This Python library is often used in machine learning to load,
   explore, clean, and transform data.

2. **One-Hot Encoding** — a critical technique for converting categorical data
   (like chords) into numbers that neural networks can process.

We'll learn both by exploring a real dataset of chord progressions from
Billboard chart songs. Along the way, you'll complete exercises that require
you to write code and think carefully about what you're seeing.

In [None]:
import torch
import pandas as pd
from collections import Counter


def check(my_answer, correct):
    """Check your answer against the correct one."""
    if isinstance(my_answer, torch.Tensor):
        my_answer = my_answer.tolist()
    if isinstance(correct, torch.Tensor):
        correct = correct.tolist()

    if my_answer == correct:
        print("✓ Correct!")
    else:
        print(f"✗ Not quite. Expected: {correct}")

---
## Part 1: Introduction to Pandas

**Pandas** is a Python library for working with structured data. The core
object is the **DataFrame** — a 2D table with labeled rows and columns.

Think of a DataFrame like a spreadsheet: rows are records, columns are fields.

In [None]:
# https://mtec345.vercel.app/resources/billboard_numerals_simple.csv
df = pd.read_csv("./billboard_numerals_simple.csv")

# Display the first few rows
df.head()

The `head()` method shows the first 5 rows by default. You can pass a number
to see more or fewer rows:

- `df.head(10)` — first 10 rows
- `df.tail(3)` — last 3 rows

In [None]:
print(f"Shape: {df.shape}")  # (rows, columns)
print(f"Columns: {list(df.columns)}")
print(f"Data types:\n{df.dtypes}")

### Key DataFrame Properties

| Property | Description | Example |
|----------|-------------|---------|
| `df.shape` | Tuple of (rows, columns) | `(220, 3)` |
| `df.columns` | List of column names | `['title', 'artist', 'chords']` |
| `df.dtypes` | Data type of each column | `object` means text/string |
| `len(df)` | Number of rows | `220` |

In [None]:
# Experiment with the properties above to get familiar with the data:
print(f"Shape: {df.shape}")
print(f"Number of songs: {len(df)}")
print(f"Columns: {list(df.columns)}")

---
## Part 2: Selecting Data in Pandas

Pandas offers several ways to access data:

| Syntax | What it does | Returns |
|--------|--------------|---------|
| `df["column"]` | Select one column | Series |
| `df[["col1", "col2"]]` | Select multiple columns | DataFrame |
| `df.iloc[0]` | Select row by position (index) | Series |
| `df.iloc[5:10]` | Select rows 5-9 | DataFrame |
| `df.loc[df["col"] == "x"]` | Select rows by condition | DataFrame |

In [None]:
titles = df["title"]
print(type(titles))  # pandas Series
titles.head()

In [None]:
subset = df[["title", "artist"]]
print(type(subset))  # pandas DataFrame
subset.head()

In [None]:
first_song = df.iloc[0]
print(f"Type: {type(first_song)}")
first_song

In [None]:
df.iloc[10:15]  # Rows 10-14

### Your Turn: Explore the Data

Use the selection methods above to browse the dataset. Try things like:
| Syntax                  | What it does |
|-------------------------|-----------------|
| `df.iloc[42]`           | look at different songs |
| `df.iloc[-1]`           | return the last row |
| `df.iloc[-2]`           | return the second-to-last row |
| `df["artist"].head(20)` | see more artists |
| `df.tail(10)`           | see the last 10 rows |
| `df.iloc[10:15]`        |  Rows 10-14 |
| `df["artist"].value_counts().head(10)`        |  Can you figure this out? |

In [None]:
# Try different selections to get a feel for the data:
df.iloc[42]

In [None]:
# Create more cells to explore as needed
df["artist"].value_counts().head(15)

In [None]:
# OPTIONAL!
# This uses some advanced features that we have not discussed. 
# If you are feeling brave, try to figure out why this works
artist_vc = df["artist"].value_counts()
artists_with_one_hit = artist_vc[artist_vc == 1]
artists_with_one_hit.head(15)

---
## Part 3: Filtering and Aggregating

We can filter rows based on conditions using boolean indexing:

In [None]:
# Find all songs by a specific artist
stones_songs = df[df["artist"] == "The Rolling Stones"]
print(f"Found {len(stones_songs)} Rolling Stones songs")
stones_songs

In [None]:
# Use .str to access string methods on text columns
michael_songs = df[df["artist"].str.contains("Michael")]
print(f"Found {len(michael_songs)} songs with 'Michael' in artist name")
michael_songs

### Useful String Methods

| Method | Description |
|--------|-------------|
| `df["col"].str.contains("x")` | Check if contains substring |
| `df["col"].str.startswith("x")` | Check if starts with |
| `df["col"].str.lower()` | Convert to lowercase |
| `df["col"].str.split(",")` | Split by delimiter |

---
## Part 4: Working with the Chords Column

Our chords column contains strings like `"I|IV|V|I"` — chord symbols
separated by `|`. We need to **split** these strings into lists.

In [None]:
chords_before_split = df["chords"].iloc[0]
print("\nBefore splitting:", chords_before_split[0:36], "...\n")

chords_after_split = chords_before_split.split("|")
print("\nAfter splitting:", chords_after_split[:8], "...\n")  # show the beginning

### Your Turn: What Changed?

In a **markdown cell**, explain what the difference is between:

- The value **before** splitting
- The value **after** splitting

Your answer should mention:
- The **data type** (use the `type(chords_before_split)` function if you are unsure)
- What is the advantage of the new format?

Write your answer in the next cell.


### Double click here to edit

Replace this text with your response.

In [None]:
# Create a new 'chords' column.

chords_split = df["chords"].str.split("|") # create a new column

print('\ndf["chords"].head():')
print(df["chords"].head())

The output of the previous cell shows `dtype:str`. Interesting!

That means that df["chords"] still contains strings...

Do you undertand why calling `df["chords"].str.split("|")` did not change the values in df["chords"] to strings?
 
Before reading the explanation below, write your best guess:
**Why didn't `df["chords"].str.split("|")` change the values inside `df["chords"]`?**


### Double click here to edit

Replace this text with your response.

### Separating Two Ideas

**Idea 1: Splitting**

- `df["chords"].str.split("|")` computes a *new* Series where each string
  becomes a list. It does not mutate the existing DataFrame.

**Idea 2: Assigning back into the DataFrame**

- `df["chords"] = ...` stores the result into the DataFrame column.
- This ability to use `=` to edit a colum is part of what make pandas useful.

We'll overwrite `df["chords"]` because the rest of the lab expects each row
to contain a list of chord symbols (not one long string).

In [None]:
chords_split = df["chords"].str.split("|") # create a new column
df["chords"] = chords_split                # assign the mutated data

print("After assigning back into df['chords']:")
print(df["chords"].head())
print(f"\nType of first value: {type(df['chords'].iloc[0])}")

Now each cell contains a **list** of chord symbols instead of a single string.

The chords are written in **Roman numeral notation**, which describes chords
relative to the key of the song:

- **I** = the "one" chord (tonic major)
- **IV** = the "four" chord (subdominant)
- **V** = the "five" chord (dominant)
- **vi** = the "six" chord (relative minor)

This notation is key-independent.

**What are the advantages and disadvantages of this approach for machine learning?** 

Make a guess and put it in the next cell.


### Double click here to edit

Replace this text with your response.

In [None]:
sample = df.iloc[0]
print(f"Title: {sample['title']}")
print(f"Artist: {sample['artist']}")
print(f"Number of chords: {len(sample['chords'])}")
print(f"First 8 chords: {sample['chords'][:8]}")

### Your Turn: Exploring Chord Progressions

Let's practice accessing the chord data step by step.

In [None]:
# First, select a song using iloc
song = df.iloc[100]
print(f"Song: '{song['title']}' by {song['artist']}")

In [None]:
# The "chords" field is now a Python list
chord_list = song["chords"]
print(f"Type: {type(chord_list)}")
print(f"First few chords: {chord_list[:8]}")

In [None]:
# Remember: len() gives the length of any list
num_chords = len(chord_list)
print(f"This song has {num_chords} chords")

In [None]:
# Use list indexing: [0] for first, [-1] for last
print(f"First chord: {chord_list[0]}")
print(f"Last chord: {chord_list[-1]}")
print(f"Chords 4-8: {chord_list[4:8]}")

### Explore on Your Own

Pick a few songs and examine their chord progressions:
- Which song has the most chords?
- Do most songs start on the same chord?
- Do songs tend to end on the "I" chord?

In [None]:
# Try looking at different songs:
df.iloc[50]["chords"][:10]

---
## Part 5: Computing New Columns

We can create new columns by applying functions to existing ones.
Let's add a column for the number of chords in each song.

In [None]:
# .apply(len) runs len() on each row's chord list
df["num_chords"] = df["chords"].apply(len)
df[["title", "artist", "num_chords"]].head(10)

In [None]:
# Pandas makes it easy to compute summary statistics
print(f"Average chords per song: {df['num_chords'].mean():.1f}")
print(f"Minimum: {df['num_chords'].min()}")
print(f"Maximum: {df['num_chords'].max()}")
print(f"Median: {df['num_chords'].median()}")

In [None]:
# These return the INDEX of the max/min value
longest_song = df.loc[df["num_chords"].idxmax()]
print(f"Longest: '{longest_song['title']}' by {longest_song['artist']} ({longest_song['num_chords']} chords)")

shortest_song = df.loc[df["num_chords"].idxmin()]
print(f"Shortest: '{shortest_song['title']}' by {shortest_song['artist']} ({shortest_song['num_chords']} chords)")

In [None]:
# Which songs have more than 200 chords?
long_songs = df[df["num_chords"] > 200]
print(f"Songs with 200+ chords: {len(long_songs)}")
long_songs[["title", "artist", "num_chords"]]

---
## Part 6: Building the Vocabulary

Before we can encode chords for a neural network, we need to know what
chords exist in our dataset. This set of all possible values is called
the **vocabulary**.

In [None]:
all_chords = []
for chord_list in df["chords"]:
    if chord_list:  # Skip empty lists
        all_chords.extend(chord_list)

print(f"Total chord occurrences: {len(all_chords)}")

In [None]:
unique_chords = sorted(set(all_chords))
print(f"Found {len(unique_chords)} unique chords:")
print(unique_chords)

of categories.

Notice the mix of major (uppercase) and minor (lowercase):
- Major: I, IV, V, VI, VII, bVII
- Minor: i, ii, iv, vi


To train our first neural network, we are going to use a small vocabulary.

The original dataset includes chords with inversions and much more complicated
variations. You make want to use these more complex data in the futue, but for
now we have simplified 7th chords and ommited many songs from the dataset that
have more complicated chords.
**That is why our dataset includes only 216 songs, then the original contains over 700.**

---
## Part 7: Chord Distribution

How often does each chord appear? This tells us about common patterns in
popular music.

In [None]:
chord_counts = Counter(all_chords)

print("Chord distribution (most common first):\n")
for chord, count in chord_counts.most_common():
    bar = "█" * (count // 500)  # Simple text bar chart
    print(f"  {chord:5s}: {count:5d} {bar}")

**What does this tell us?**

- **I** dominates — most songs emphasize the tonic chord
- **IV** and **V** are the next most common (the classic I-IV-V progression)
- Minor chords (i, vi, ii, iv) appear less frequently
- **bVII** and **VII** are relatively rare

This imbalance will affect our neural network — it will learn more about
common chords than rare ones.

In 2–4 sentences:
- What chord(s) seem most common?
- How might this imbalance affect a model trained to predict the next chord?


### Double click here to edit

Replace this text with your response.

In [None]:
# What fraction of all chords is just "I"?
i_percentage = chord_counts["I"] / len(all_chords) * 100
print(f"The 'I' chord makes up {i_percentage:.1f}% of all chord occurrences!")

---
## Part 8: Why One-Hot Encoding?

Now we need to convert these chord symbols into numbers for our neural network.

**The naive approach**: assign each chord an integer.

In [None]:
naive_encoding = {chord: i for i, chord in enumerate(unique_chords)}
print("Naive integer encoding:")
for chord, idx in naive_encoding.items():
    print(f"  {chord} → {idx}")

**Why is this problematic?**

If we feed these integers directly into a linear layer, the model might learn
that:
- "I" (0) and "IV" (1) are "close" because 0 and 1 are close numbers
- "I" (0) and "vi" (9) are "far apart" because 0 and 9 are distant

But musically, this makes no sense! The relationship between chords isn't
captured by arbitrary index numbers.

**One-hot encoding** solves this by treating all categories as **equidistant**.
Each chord becomes a vector of zeros with a single 1 at its position.

### Think About It

In the naive encoding, "IV" is encoded as 1 and "V" as 2. A neural network
might learn these are "close" because 1 and 2 are close numbers.

Is this good or bad? Actually, IV and V *are* musically related — they're
both major chords commonly used together. So sometimes the naive encoding
creates useful relationships **by accident**.

The problem is that it also creates *meaningless* relationships. Is "I" (0)
really maximally different from "vi" (9)? Not musically!

**One-hot encoding** gives us a clean slate. All chords start equidistant,
and the network learns *actual* relationships from the data.

In your own words:
- Why can integer (naive) encoding be misleading for chord categories?
- What does one-hot encoding fix?


### Double click here to edit

Replace this text with your response.

---
## Part 9: Build Encoder/Decoder Functions

Let's build functions to convert between chord symbols and one-hot vectors.

In [None]:
CHORDS = sorted(set(all_chords))  # Our vocabulary
NUM_CHORDS = len(CHORDS)

# Use a python "dictionary comprehension" to construct a mapping dict

stoi = {chord: i for i, chord in enumerate(CHORDS)}

# `stoi` is a python dictionary that maps string=>index

print(f"Vocabulary size: {NUM_CHORDS}\n")
stoi

In [None]:
def encode(chord):
    """One-hot encode a chord symbol."""
    index = stoi[chord]
    vector = torch.zeros(NUM_CHORDS)
    vector[index] = 1.0
    return vector


# Test it
print('encode("I") =', encode("I"))
print('encode("V") =', encode("V"))
print('encode("vi") =', encode("vi"))

In [None]:
def decode(vector):
    """Decode a one-hot vector back to a chord symbol."""
    index = torch.argmax(vector).item()
    return CHORDS[index]


# Test round-trip
original = "IV"
encoded = encode(original)
decoded = decode(encoded)
print(f"Original: {original}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

### Your Turn

Predict what `encode("ii")` will produce:

In [None]:
my_guess = []  # FILL THIS IN - a list of 10 values (0s and 1s)

check(my_guess, encode("ii").tolist())

Now predict what `decode(torch.tensor([0, 0, 0, 0, 0, 0, 0, 1, 0, 0]))` returns:

In [None]:
my_guess = ""  # FILL THIS IN - a chord symbol like "I" or "vi"

check(my_guess, decode(torch.tensor([0, 0, 0, 0, 0, 0, 0, 1, 0, 0])))

---
## Part 10: Encode a Sequence

Neural networks often need to process **sequences** of chords, not just
single chords. Let's encode a chord progression.

In [None]:
progression = ["I", "IV", "V", "I"]

# Encode each chord
encoded_chords = [encode(c) for c in progression]

# Look at the individual encodings
for chord, vec in zip(progression, encoded_chords):
    print(f"{chord}: {vec}")

In [None]:
stacked = torch.stack(encoded_chords)
print(f"\nStacked shape: {stacked.shape}")
print(stacked)

This is a `(4, 10)` tensor — 4 chords, each with 10 values.

For a multi-layer perceptron (MLP), we usually want a **flat** 1D input.
We can flatten the matrix:

In [None]:
flattened = stacked.flatten()
print(f"Flattened shape: {flattened.shape}")
print(flattened)

Now our 4-chord progression is a single vector of **40 values** (4 × 10).

This is exactly what we'll feed into our neural network!

Explain (briefly) what `.flatten()` did here.

Your answer should mention:
- How the **shape** changed
- Why we might prefer a flat 1D vector for an MLP


### Double click here to edit

Replace this text with your response.

In [None]:
def encode_sequence(chords):
    """Encode a list of chords as a flat tensor."""
    encoded = [encode(c) for c in chords]
    return torch.stack(encoded).flatten()


# Test it
test_seq = ["I", "V", "vi", "IV"]
encoded = encode_sequence(test_seq)
print(f"Sequence: {test_seq}")
print(f"Encoded shape: {encoded.shape}")

### Your Turn

If we encode a sequence of **5 chords**, what will the flattened tensor's
shape be?

In [None]:
my_guess = None  # FILL THIS IN - a single number

five_chord_seq = ["I", "IV", "V", "IV", "I"]
actual_shape = encode_sequence(five_chord_seq).shape[0]
check(my_guess, actual_shape)

---
## Part 11: Decoding Model Output

When a neural network makes a prediction, it doesn't output a perfect one-hot
vector. Instead, it outputs **raw scores** (called "logits") for each category.

These can be any numbers — positive, negative, large, small.

In [None]:
# Pretend this came from a neural network
logits = torch.tensor([0.5, -0.2, 0.1, 0.8, -0.5, 2.1, 0.3, -0.1, 1.5, 0.4])
print(f"Logits: {logits}")
print(f"Shape: {logits.shape}")

To get **probabilities**, we apply the **softmax** function.
It converts raw scores into values between 0 and 1 that sum to 1.

In [None]:
probabilities = torch.softmax(logits, dim=0)
print(f"Probabilities: {probabilities}")
print(f"Sum: {probabilities.sum():.4f}")

To get the model's prediction, we find the **highest probability**:

In [None]:
predicted_index = torch.argmax(probabilities).item()
predicted_chord = CHORDS[predicted_index]
confidence = probabilities[predicted_index].item()

print(f"Predicted index: {predicted_index}")
print(f"Predicted chord: {predicted_chord}")
print(f"Confidence: {confidence:.1%}")

The complete pipeline:

```
Input chords → encode → MODEL → logits → softmax → argmax → decode → Output chord
```

In your own words, what are:
- **logits**
- **probabilities**

And why do we often apply **softmax** before choosing the top prediction?


### Double click here to edit

Replace this text with your response.

### Your Turn

Given these logits, what chord will the model predict?

In [None]:
test_logits = torch.tensor([0, 0, 0, 5, 0, 0, 0, 0, 0, 0])

my_guess = ""  # FILL THIS IN - a chord symbol

# Compute the answer
probs = torch.softmax(test_logits, dim=0, dtype=torch.float)
predicted = CHORDS[torch.argmax(probs).item()]
check(my_guess, predicted)

---
## Summary

**Pandas Skills:**
- Load data: `pd.read_csv()`
- Explore: `.shape`, `.columns`, `.head()`, `.tail()`, `.iloc[]`
- Filter: `df[df["col"] == "value"]`, `.str.contains()`
- Transform: `.str.split()`, `.apply()`
- Aggregate: `.value_counts()`, `.mean()`, `.max()`, `.min()`

**Encoding Skills:**
- **Vocabulary**: The set of all possible values (10 chords in our dataset)
- **stoi**: Dictionary mapping symbols to indices
- **One-hot encoding**: Vector of zeros with a single 1 at the item's index
- **Why one-hot**: Treats all categories as equidistant (no false ordering)
- **Sequences**: Stack and flatten multiple one-hot vectors for MLP input
- **Decoding**: Use `argmax` to go from model output back to a symbol

These techniques apply to any categorical data in machine learning!

---
## What's Next?

In the next notebook (`next_chord_prediction.py`), we'll use these encoding
skills to train a neural network that predicts the **next chord** in a
progression.

Given chords like `["I", "IV", "V", "I"]`, the model will learn to predict
what chord comes next!