# Dataset Review and Training Preprocessing

This notebook helps you review the dataset and prepare that data for training.

After completing you should understand the basics of:

1. **Pandas** — This Python library is often used in machine learning to load,
   explore, clean, and transform data.

2. **One-Hot Encoding** — a critical technique for converting categorical data
   (like chords) into numbers that neural networks can process.

We'll learn both by exploring a real dataset of chord progressions from
Billboard chart songs. Along the way, you'll complete exercises that require
you to write code and think carefully about what you're seeing.

In [3]:
import torch
import pandas as pd
from collections import Counter


def check(my_answer, correct):
    """Check your answer against the correct one."""
    if isinstance(my_answer, torch.Tensor):
        my_answer = my_answer.tolist()
    if isinstance(correct, torch.Tensor):
        correct = correct.tolist()

    if my_answer == correct:
        print("✓ Correct!")
    else:
        print(f"✗ Not quite. Expected: {correct}")

---
## Part 1: Introduction to Pandas

**Pandas** is a Python library for working with structured data. The core
object is the **DataFrame** — a 2D table with labeled rows and columns.

Think of a DataFrame like a spreadsheet: rows are records, columns are fields.

In [4]:
# https://mtec345.vercel.app/resources/billboard_numerals_simple.csv
df = pd.read_csv("./billboard_numerals_simple.csv")

# Display the first few rows
df.head()

Unnamed: 0,id,title,artist,chart_date,chords
0,19,Here's Some Love,Tanya Tucker,1976-10-09,vi|ii|IV|V|I|IV|V|I|IV|V|IV|V|I|IV|V|I|vi|IV|V...
1,29,The Joker,Steve Miller Band,1973-11-10,I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|I|IV|V...
2,37,Foggy Mountain Breakdown,Flatt & Scruggs,1968-04-13,I|vi|I|vi|I|V|I|vi|I|vi|I|V|I|vi|I|vi|I|V|I|vi...
3,44,Some Like It Hot,The Power Station,1985-04-06,i|VI|VII|i|VI|VII|i|VI|i|VI|i|VI|i|VI|i|VI|i|V...
4,46,I'll Take You There,The Staple Singers,1972-07-01,I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I...


The `head()` method shows the first 5 rows by default. You can pass a number
to see more or fewer rows:

- `df.head(10)` — first 10 rows
- `df.tail(3)` — last 3 rows

In [5]:
print(f"Shape: {df.shape}")  # (rows, columns)
print(f"Columns: {list(df.columns)}")
print(f"Data types:\n{df.dtypes}")

Shape: (219, 5)
Columns: ['id', 'title', 'artist', 'chart_date', 'chords']
Data types:
id            int64
title           str
artist          str
chart_date      str
chords          str
dtype: object


### Key DataFrame Properties

| Property | Description | Example |
|----------|-------------|---------|
| `df.shape` | Tuple of (rows, columns) | `(220, 3)` |
| `df.columns` | List of column names | `['title', 'artist', 'chords']` |
| `df.dtypes` | Data type of each column | `object` means text/string |
| `len(df)` | Number of rows | `220` |

In [6]:
# Experiment with the properties above to get familiar with the data:
print(f"Shape: {df.shape}")
print(f"Number of songs: {len(df)}")
print(f"Columns: {list(df.columns)}")

Shape: (219, 5)
Number of songs: 219
Columns: ['id', 'title', 'artist', 'chart_date', 'chords']


---
## Part 2: Selecting Data in Pandas

Pandas offers several ways to access data:

| Syntax | What it does | Returns |
|--------|--------------|---------|
| `df["column"]` | Select one column | Series |
| `df[["col1", "col2"]]` | Select multiple columns | DataFrame |
| `df.iloc[0]` | Select row by position (index) | Series |
| `df.iloc[5:10]` | Select rows 5-9 | DataFrame |
| `df.loc[df["col"] == "x"]` | Select rows by condition | DataFrame |

In [19]:
titles = df["title"]
print(type(titles))  # pandas Series
titles.head()

<class 'pandas.Series'>


0            Here's Some Love
1                   The Joker
2    Foggy Mountain Breakdown
3            Some Like It Hot
4         I'll Take You There
Name: title, dtype: str

In [20]:
subset = df[["title", "artist"]]
print(type(subset))  # pandas DataFrame
subset.head()

<class 'pandas.DataFrame'>


Unnamed: 0,title,artist
0,Here's Some Love,Tanya Tucker
1,The Joker,Steve Miller Band
2,Foggy Mountain Breakdown,Flatt & Scruggs
3,Some Like It Hot,The Power Station
4,I'll Take You There,The Staple Singers


In [21]:
first_song = df.iloc[0]
print(f"Type: {type(first_song)}")
first_song

Type: <class 'pandas.Series'>


id                                                           19
title                                          Here's Some Love
artist                                             Tanya Tucker
chart_date                                           1976-10-09
chords        vi|ii|IV|V|I|IV|V|I|IV|V|IV|V|I|IV|V|I|vi|IV|V...
Name: 0, dtype: object

In [22]:
df.iloc[10:15]  # Rows 10-14

Unnamed: 0,id,title,artist,chart_date,chords
10,81,You Can Call Me Al,Paul Simon,1986-08-23,I|V|IV|V|I|V|IV|V|I|V|IV|V|I|V|IV|V|I|IV|I|IV|...
11,83,Red Red Wine,UB40,1984-04-07,I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|V|IV|V|I|IV|V|IV...
12,97,Motownphilly,Boyz II Men,1991-08-10,i|VI|V|i|VI|V|i|VI|V|i|VI|V|i|iv|V|VI|VII|V|i|...
13,106,If,Bread,1971-06-05,I|V|IV|iv|I|iv|V|I|V|IV|iv|I|iv|V|I|V|IV|iv|I|...
14,107,Sweet Home Alabama,Lynyrd Skynyrd,1974-10-26,V|IV|I|V|IV|I|V|IV|I|V|IV|I|V|IV|I|V|IV|I|V|IV...


### Your Turn: Explore the Data

Use the selection methods above to browse the dataset. Try things like:
| Syntax                  | What it does |
|-------------------------|-----------------|
| `df.iloc[42]`           | look at different songs |
| `df.iloc[-1]`           | return the last row |
| `df.iloc[-2]`           | return the second-to-last row |
| `df["artist"].head(20)` | see more artists |
| `df.tail(10)`           | see the last 10 rows |
| `df.iloc[10:15]`        |  Rows 10-14 |
| `df["artist"].value_counts().head(10)`        |  Can you figure this out? |

In [25]:
df.iloc[42:45]

Unnamed: 0,id,title,artist,chart_date,chords
42,249,Old Time Rock & Roll,Bob Seger,1979-05-19,I|V|I|V|I|IV|V|I|IV|V|I|V|I|IV|V|I|V|I|IV|V|I|...
43,250,Handy Man,Jimmy Jones,1960-02-08,I|vi|I|vi|I|vi|IV|V|I|V|I|vi|IV|V|I|IV|I|IV|V|...
44,256,Smokin' In The Boy's Room,Brownsville Station,1973-11-03,IV|I|V|IV|V|IV|I|V|IV|V|IV|I|IV|V|IV|I|IV|ii|V...


In [26]:
# Try different selections to get a feel for the data:
df.iloc[42]

id                                                          249
title                                      Old Time Rock & Roll
artist                                                Bob Seger
chart_date                                           1979-05-19
chords        I|V|I|V|I|IV|V|I|IV|V|I|V|I|IV|V|I|V|I|IV|V|I|...
Name: 42, dtype: object

In [27]:
# Create more cells to explore as neededa
df["artist"].value_counts().head(15)

artist
The Rolling Stones     4
John Denver            4
Dion                   3
The Everly Brothers    3
Eric Clapton           3
Ray Charles            3
Bob Seger              3
James Brown            3
Kenny Rogers           3
Billy Squier           3
Steve Miller Band      2
The Staple Singers     2
UB40                   2
Chuck Berry            2
Brenda Lee             2
Name: count, dtype: int64

In [36]:
df["title"].iloc[14:60].value_counts()

title
Sweet Home Alabama                  1
Sunshine Of Your Love               1
Sweet Little Rock And Roll          1
Stand By Me                         1
Silent Night                        1
Blue Eyes Crying In The Rain        1
Brandy (You're A Fine Girl)         1
Tumbling Dice                       1
The Way You Do The Things You Do    1
La Grange                           1
Runaround Sue                       1
Just When I Needed You Most         1
Walk Right Back                     1
Forever Man                         1
Sleep Walk                          1
If You Need Me                      1
Sweet Nothin's                      1
No Charge                           1
Willie And The Hand Jive            1
I Want To Take You Higher           1
Already Gone                        1
She's A Lady                        1
(Night Time Is) The Right Time      1
For Ol' Times Sake                  1
Still Cruisin                       1
Life Is A Carnival                  1
Back H

In [51]:
# OPTIONAL!
# This uses some advanced features that we have not discussed. 
# If you are feeling brave, try to figure out why this works
artist_vc = df["artist"].value_counts()
artists_with_one_hit = artist_vc[artist_vc == 1]
artists_with_one_hit.head(80)

artist
Tanya Tucker                       1
Flatt & Scruggs                    1
The Power Station                  1
The Beatles                        1
J. Frank Wilson & The Cavaliers    1
                                  ..
Otis Redding                       1
U2                                 1
Juice Newton                       1
Rita Coolidge                      1
Badfinger                          1
Name: count, Length: 80, dtype: int64

In [42]:
df["artist"].value_counts()

artist
The Rolling Stones     4
John Denver            4
Dion                   3
The Everly Brothers    3
Eric Clapton           3
                      ..
Sonny & Cher           1
Bobby Darin            1
Shannon                1
The La's               1
The Chiffons           1
Name: count, Length: 167, dtype: int64

In [45]:
artist_vc = df["artist"].value_counts()
artists_with_one_hit = artist_vc[artist_vc == 1]
print(artists_with_one_hit)

artist
Tanya Tucker                       1
Flatt & Scruggs                    1
The Power Station                  1
The Beatles                        1
J. Frank Wilson & The Cavaliers    1
                                  ..
Sonny & Cher                       1
Bobby Darin                        1
Shannon                            1
The La's                           1
The Chiffons                       1
Name: count, Length: 127, dtype: int64


---
## Part 3: Filtering and Aggregating

We can filter rows based on conditions using boolean indexing:

In [53]:
# Find all songs by a specific artist
stones_songs = df[df["artist"] == "The Rolling Stones"]
print(f"Found {len(stones_songs)} Rolling Stones songs")
stones_songs

Found 4 Rolling Stones songs


Unnamed: 0,id,title,artist,chart_date,chords
21,130,Tumbling Dice,The Rolling Stones,1972-06-24,IV|I|V|I|V|I|IV|V|I|V|I|V|I|IV|V|I|V|I|V|I|IV|...
49,278,Waiting On A Friend,The Rolling Stones,1982-01-09,I|ii|IV|I|ii|IV|I|ii|IV|I|ii|IV|I|ii|IV|I|ii|I...
86,457,Not Fade Away,The Rolling Stones,1964-06-13,I|IV|bVII|IV|I|IV|I|IV|bVII|IV|I|IV|bVII|IV|I|...
97,512,Going To A Go-Go,The Rolling Stones,1982-06-12,I|bVII|IV|I|bVII|I|bVII|IV|I|bVII|I|bVII|IV|I|...


In [54]:
# Use .str to access string methods on text columns
michael_songs = df[df["artist"].str.contains("Michael")]
print(f"Found {len(michael_songs)} songs with 'Michael' in artist name")
michael_songs

Found 1 songs with 'Michael' in artist name


Unnamed: 0,id,title,artist,chart_date,chords
119,629,Beat It,Michael Jackson,1983-04-02,i|VII|i|VII|i|VII|i|VII|i|VII|i|VII|VI|VII|i|V...


### Useful String Methods

| Method | Description |
|--------|-------------|
| `df["col"].str.contains("x")` | Check if contains substring |
| `df["col"].str.startswith("x")` | Check if starts with |
| `df["col"].str.lower()` | Convert to lowercase |
| `df["col"].str.split(",")` | Split by delimiter |

---
## Part 4: Working with the Chords Column

Our chords column contains strings like `"I|IV|V|I"` — chord symbols
separated by `|`. We need to **split** these strings into lists.

In [56]:
chords_before_split = df["chords"].iloc[0]
print("\nBefore splitting:", chords_before_split[0:36], "...\n")

chords_after_split = chords_before_split.split("|")
print("\nAfter splitting:", chords_after_split[:8], "...\n")  # show the beginning


Before splitting: vi|ii|IV|V|I|IV|V|I|IV|V|IV|V|I|IV|V ...


After splitting: ['vi', 'ii', 'IV', 'V', 'I', 'IV', 'V', 'I'] ...



### Your Turn: What Changed?

In a **markdown cell**, explain what the difference is between:

- The value **before** splitting
- The value **after** splitting

Your answer should mention:
- The **data type** (use the `type(chords_before_split)` function if you are unsure)
- What is the advantage of the new format?

Write your answer in the next cell.


### Double click here to edit

For example, before splitting the artist column, the whole context is a str, but after splitting the the str object, the data type becomes a list which contains multiple str.

In [57]:
# Create a new 'chords' column.

chords_split = df["chords"].str.split("|") # create a new column

print('\ndf["chords"].head():')
print(df["chords"].head())


df["chords"].head():
0    vi|ii|IV|V|I|IV|V|I|IV|V|IV|V|I|IV|V|I|vi|IV|V...
1    I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|I|IV|V...
2    I|vi|I|vi|I|V|I|vi|I|vi|I|V|I|vi|I|vi|I|V|I|vi...
3    i|VI|VII|i|VI|VII|i|VI|i|VI|i|VI|i|VI|i|VI|i|V...
4    I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I...
Name: chords, dtype: str


The output of the previous cell shows `dtype:str`. Interesting!

That means that df["chords"] still contains strings...

Do you undertand why calling `df["chords"].str.split("|")` did not change the values in df["chords"] to strings?
 
Before reading the explanation below, write your best guess:
**Why didn't `df["chords"].str.split("|")` change the values inside `df["chords"]`?**


### Double click here to edit
after the split the df only changes the way str is formed but itself hasen't been put into a variable?


### Separating Two Ideas

**Idea 1: Splitting**

- `df["chords"].str.split("|")` computes a *new* Series where each string
  becomes a list. It does not mutate the existing DataFrame.

**Idea 2: Assigning back into the DataFrame**

- `df["chords"] = ...` stores the result into the DataFrame column.
- This ability to use `=` to edit a colum is part of what make pandas useful.

We'll overwrite `df["chords"]` because the rest of the lab expects each row
to contain a list of chord symbols (not one long string).

In [58]:
chords_split = df["chords"].str.split("|") # create a new column
df["chords"] = chords_split                # assign the mutated data

print("After assigning back into df['chords']:")
print(df["chords"].head())
print(f"\nType of first value: {type(df['chords'].iloc[0])}")

After assigning back into df['chords']:
0    [vi, ii, IV, V, I, IV, V, I, IV, V, IV, V, I, ...
1    [I, IV, V, IV, I, IV, V, IV, I, IV, V, IV, I, ...
2    [I, vi, I, vi, I, V, I, vi, I, vi, I, V, I, vi...
3    [i, VI, VII, i, VI, VII, i, VI, i, VI, i, VI, ...
4    [I, IV, I, IV, I, IV, I, IV, I, IV, I, IV, I, ...
Name: chords, dtype: object

Type of first value: <class 'list'>


Now each cell contains a **list** of chord symbols instead of a single string.

The chords are written in **Roman numeral notation**, which describes chords
relative to the key of the song:

- **I** = the "one" chord (tonic major)
- **IV** = the "four" chord (subdominant)
- **V** = the "five" chord (dominant)
- **vi** = the "six" chord (relative minor)

This notation is key-independent.

**What are the advantages and disadvantages of this approach for machine learning?** 

Make a guess and put it in the next cell.


### Double click here to edit

This approach helps machine learning become easier because it ignores the absolute key so it helps simplify the process, but the disadvantage is that sometime the key signature matters to genre and different singer has different range for it might affect the result of the song.

In [59]:
sample = df.iloc[0]
print(f"Title: {sample['title']}")
print(f"Artist: {sample['artist']}")
print(f"Number of chords: {len(sample['chords'])}")
print(f"First 8 chords: {sample['chords'][:8]}")

Title: Here's Some Love
Artist: Tanya Tucker
Number of chords: 66
First 8 chords: ['vi', 'ii', 'IV', 'V', 'I', 'IV', 'V', 'I']


### Your Turn: Exploring Chord Progressions

Let's practice accessing the chord data step by step.

In [60]:
# First, select a song using iloc
song = df.iloc[100]
print(f"Song: '{song['title']}' by {song['artist']}")

Song: 'Running Up That Hill' by Kate Bush


In [61]:
# The "chords" field is now a Python list
chord_list = song["chords"]
print(f"Type: {type(chord_list)}")
print(f"First few chords: {chord_list[:8]}")

Type: <class 'list'>
First few chords: ['i', 'VI', 'VII', 'i', 'VI', 'VII', 'i', 'VI']


In [62]:
# Remember: len() gives the length of any list
num_chords = len(chord_list)
print(f"This song has {num_chords} chords")

This song has 111 chords


In [None]:
# Use list indexing: [0] for first, [-1] for last
print(f"First chord: {chord_list[0]}")
print(f"Last chord: {chord_list[-1]}")
print(f"Chords 4-8: {chord_list[4:8]}")

### Explore on Your Own

Pick a few songs and examine their chord progressions:
- Which song has the most chords?
- Do most songs start on the same chord?
- Do songs tend to end on the "I" chord?

In [105]:
df = df.dropna(subset=["chords"])
df["chord_list"] = df["chords"].str.split("|")

df["chord_count"] = df["chord_list"].apply(len)

most_chords_index = df["chord_count"].idxmax()

song_with_most = df.loc[most_chords_index]

print(f"Most chord Title:  {song_with_most['title']}")
print(f"Artist:        {song_with_most['artist']}")
print(f"Chord Count:   {song_with_most['chord_count']}")
song_chords = df.iloc[:]
def first_chord(chord_list):
    return chord_list[0]
df["first_chord"] = df["chord_list"].apply(first_chord)
print(df[["title", "first_chord"]].head(50))
print("yes")
def last_chord(chord_list):
    return chord_list[-1]
df["first_chord"] = df["chord_list"].apply(last_chord)
print(df[["title", "last_chord"]].head(50))
print("yes")



Most chord Title:  Got It Made
Artist:        Crosby, Stills, Nash
Chord Count:   242
                               title first_chord
0                   Here's Some Love          vi
1                          The Joker           I
2           Foggy Mountain Breakdown           I
3                   Some Like It Hot           i
4                I'll Take You There           I
5                         Love Me Do           I
6                          Last Kiss           I
7                        Smoking Gun           i
8                     Do You Love Me           I
9                       Get Together           I
10                You Can Call Me Al           I
11                      Red Red Wine           I
12                      Motownphilly           i
13                                If           I
14                Sweet Home Alabama           V
15             Sunshine Of Your Love           I
16        Sweet Little Rock And Roll           I
17                       Stand B

In [106]:
# Try looking at different songs:
df.iloc[50]["chords"][:10]

'I|IV|V|I|I'

---
## Part 5: Computing New Columns

We can create new columns by applying functions to existing ones.
Let's add a column for the number of chords in each song.

In [114]:
# .apply(len) runs len() on each row's chord list
df["num_chords"] = df["chords"].apply(len)
df[["title", "artist", "num_chords"]].head(10)

Unnamed: 0,title,artist,num_chords
0,Here's Some Love,Tanya Tucker,159
1,The Joker,Steve Miller Band,353
2,Foggy Mountain Breakdown,Flatt & Scruggs,183
3,Some Like It Hot,The Power Station,175
4,I'll Take You There,The Staple Singers,194
5,Love Me Do,The Beatles,148
6,Last Kiss,J. Frank Wilson & The Cavaliers,179
7,Smoking Gun,The Robert Cray Band,51
8,Do You Love Me,The Contours,271
9,Get Together,The Youngbloods,203


In [115]:
# Pandas makes it easy to compute summary statistics
print(f"Average chords per song: {df['num_chords'].mean():.1f}")
print(f"Minimum: {df['num_chords'].min()}")
print(f"Maximum: {df['num_chords'].max()}")
print(f"Median: {df['num_chords'].median()}")

Average chords per song: 186.3
Minimum: 1
Maximum: 639
Median: 162.0


In [116]:
# These return the INDEX of the max/min value
longest_song = df.loc[df["num_chords"].idxmax()]
print(f"Longest: '{longest_song['title']}' by {longest_song['artist']} ({longest_song['num_chords']} chords)")

shortest_song = df.loc[df["num_chords"].idxmin()]
print(f"Shortest: '{shortest_song['title']}' by {shortest_song['artist']} ({shortest_song['num_chords']} chords)")

Longest: 'Got It Made' by Crosby, Stills, Nash (639 chords)
Shortest: 'La Grange' by ZZ Top (1 chords)


In [117]:
# Which songs have more than 200 chords?
long_songs = df[df["num_chords"] > 200]
print(f"Songs with 200+ chords: {len(long_songs)}")
long_songs[["title", "artist", "num_chords"]]

Songs with 200+ chords: 73


Unnamed: 0,title,artist,num_chords
1,The Joker,Steve Miller Band,353
8,Do You Love Me,The Contours,271
9,Get Together,The Youngbloods,203
10,You Can Call Me Al,Paul Simon,500
11,Red Red Wine,UB40,421
...,...,...,...
208,Lucky Man,"Emerson, Lake & Palmer",207
210,Island Of Lost Souls,Blondie,421
211,Baby Don't Go,Sonny & Cher,336
215,Let The Music Play,Shannon,440


---
## Part 6: Building the Vocabulary

Before we can encode chords for a neural network, we need to know what
chords exist in our dataset. This set of all possible values is called
the **vocabulary**.

In [118]:
all_chords = []
for chord_list in df["chords"]:
    if chord_list:  # Skip empty lists
        all_chords.extend(chord_list)

print(f"Total chord occurrences: {len(all_chords)}")

Total chord occurrences: 40235


In [119]:
unique_chords = sorted(set(all_chords))
print(f"Found {len(unique_chords)} unique chords:")
print(unique_chords)

Found 6 unique chords:
['I', 'V', 'b', 'i', 'v', '|']


of categories.

Notice the mix of major (uppercase) and minor (lowercase):
- Major: I, IV, V, VI, VII, bVII
- Minor: i, ii, iv, vi


To train our first neural network, we are going to use a small vocabulary.

The original dataset includes chords with inversions and much more complicated
variations. You make want to use these more complex data in the futue, but for
now we have simplified 7th chords and ommited many songs from the dataset that
have more complicated chords.
**That is why our dataset includes only 216 songs, then the original contains over 700.**

---
## Part 7: Chord Distribution

How often does each chord appear? This tells us about common patterns in
popular music.

In [113]:
chord_counts = Counter(all_chords)

print("Chord distribution (most common first):\n")
for chord, count in chord_counts.most_common():
    bar = "█" * (count // 500)  # Simple text bar chart
    print(f"  {chord:5s}: {count:5d} {bar}")

Chord distribution (most common first):

  |    : 15651 ███████████████████████████████
  I    : 11391 ██████████████████████
  V    :  8621 █████████████████
  i    :  3218 ██████
  v    :   902 █
  b    :   452 


**What does this tell us?**

- **I** dominates — most songs emphasize the tonic chord
- **IV** and **V** are the next most common (the classic I-IV-V progression)
- Minor chords (i, vi, ii, iv) appear less frequently
- **bVII** and **VII** are relatively rare

This imbalance will affect our neural network — it will learn more about
common chords than rare ones.

In 2–4 sentences:
- What chord(s) seem most common?
- How might this imbalance affect a model trained to predict the next chord?
  


### Double click here to edit
I .
The appearence might not be accurate for the prediction because the best results is not only based on the it but also the chord structure and how the progression resolves, if we follow this rule we might end up getting song with onlt one chord.

In [120]:
# What fraction of all chords is just "I"?
i_percentage = chord_counts["I"] / len(all_chords) * 100
print(f"The 'I' chord makes up {i_percentage:.1f}% of all chord occurrences!")

The 'I' chord makes up 28.3% of all chord occurrences!


---
## Part 8: Why One-Hot Encoding?

Now we need to convert these chord symbols into numbers for our neural network.

**The naive approach**: assign each chord an integer.

In [121]:
naive_encoding = {chord: i for i, chord in enumerate(unique_chords)}
print("Naive integer encoding:")
for chord, idx in naive_encoding.items():
    print(f"  {chord} → {idx}")

Naive integer encoding:
  I → 0
  V → 1
  b → 2
  i → 3
  v → 4
  | → 5


**Why is this problematic?**

If we feed these integers directly into a linear layer, the model might learn
that:
- "I" (0) and "IV" (1) are "close" because 0 and 1 are close numbers
- "I" (0) and "vi" (9) are "far apart" because 0 and 9 are distant

But musically, this makes no sense! The relationship between chords isn't
captured by arbitrary index numbers.

**One-hot encoding** solves this by treating all categories as **equidistant**.
Each chord becomes a vector of zeros with a single 1 at its position.

### Think About It

In the naive encoding, "IV" is encoded as 1 and "V" as 2. A neural network
might learn these are "close" because 1 and 2 are close numbers.

Is this good or bad? Actually, IV and V *are* musically related — they're
both major chords commonly used together. So sometimes the naive encoding
creates useful relationships **by accident**.

The problem is that it also creates *meaningless* relationships. Is "I" (0)
really maximally different from "vi" (9)? Not musically!

**One-hot encoding** gives us a clean slate. All chords start equidistant,
and the network learns *actual* relationships from the data.

In your own words:
- Why can integer (naive) encoding be misleading for chord categories?
- What does one-hot encoding fix?


### Double click here to edit

Replace this text with your response.

---
## Part 9: Build Encoder/Decoder Functions

Let's build functions to convert between chord symbols and one-hot vectors.

In [None]:
CHORDS = sorted(set(all_chords))  # Our vocabulary
NUM_CHORDS = len(CHORDS)

# Use a python "dictionary comprehension" to construct a mapping dict

stoi = {chord: i for i, chord in enumerate(CHORDS)}

# `stoi` is a python dictionary that maps string=>index

print(f"Vocabulary size: {NUM_CHORDS}\n")
stoi

In [None]:
def encode(chord):
    """One-hot encode a chord symbol."""
    index = stoi[chord]
    vector = torch.zeros(NUM_CHORDS)
    vector[index] = 1.0
    return vector


# Test it
print('encode("I") =', encode("I"))
print('encode("V") =', encode("V"))
print('encode("vi") =', encode("vi"))

In [None]:
def decode(vector):
    """Decode a one-hot vector back to a chord symbol."""
    index = torch.argmax(vector).item()
    return CHORDS[index]


# Test round-trip
original = "IV"
encoded = encode(original)
decoded = decode(encoded)
print(f"Original: {original}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

### Your Turn

Predict what `encode("ii")` will produce:

In [None]:
my_guess = []  # FILL THIS IN - a list of 10 values (0s and 1s)

check(my_guess, encode("ii").tolist())

Now predict what `decode(torch.tensor([0, 0, 0, 0, 0, 0, 0, 1, 0, 0]))` returns:

In [None]:
my_guess = ""  # FILL THIS IN - a chord symbol like "I" or "vi"

check(my_guess, decode(torch.tensor([0, 0, 0, 0, 0, 0, 0, 1, 0, 0])))

---
## Part 10: Encode a Sequence

Neural networks often need to process **sequences** of chords, not just
single chords. Let's encode a chord progression.

In [None]:
progression = ["I", "IV", "V", "I"]

# Encode each chord
encoded_chords = [encode(c) for c in progression]

# Look at the individual encodings
for chord, vec in zip(progression, encoded_chords):
    print(f"{chord}: {vec}")

In [None]:
stacked = torch.stack(encoded_chords)
print(f"\nStacked shape: {stacked.shape}")
print(stacked)

This is a `(4, 10)` tensor — 4 chords, each with 10 values.

For a multi-layer perceptron (MLP), we usually want a **flat** 1D input.
We can flatten the matrix:

In [None]:
flattened = stacked.flatten()
print(f"Flattened shape: {flattened.shape}")
print(flattened)

Now our 4-chord progression is a single vector of **40 values** (4 × 10).

This is exactly what we'll feed into our neural network!

Explain (briefly) what `.flatten()` did here.

Your answer should mention:
- How the **shape** changed
- Why we might prefer a flat 1D vector for an MLP


### Double click here to edit

Replace this text with your response.

In [None]:
def encode_sequence(chords):
    """Encode a list of chords as a flat tensor."""
    encoded = [encode(c) for c in chords]
    return torch.stack(encoded).flatten()


# Test it
test_seq = ["I", "V", "vi", "IV"]
encoded = encode_sequence(test_seq)
print(f"Sequence: {test_seq}")
print(f"Encoded shape: {encoded.shape}")

### Your Turn

If we encode a sequence of **5 chords**, what will the flattened tensor's
shape be?

In [None]:
my_guess = None  # FILL THIS IN - a single number

five_chord_seq = ["I", "IV", "V", "IV", "I"]
actual_shape = encode_sequence(five_chord_seq).shape[0]
check(my_guess, actual_shape)

---
## Part 11: Decoding Model Output

When a neural network makes a prediction, it doesn't output a perfect one-hot
vector. Instead, it outputs **raw scores** (called "logits") for each category.

These can be any numbers — positive, negative, large, small.

In [None]:
# Pretend this came from a neural network
logits = torch.tensor([0.5, -0.2, 0.1, 0.8, -0.5, 2.1, 0.3, -0.1, 1.5, 0.4])
print(f"Logits: {logits}")
print(f"Shape: {logits.shape}")

To get **probabilities**, we apply the **softmax** function.
It converts raw scores into values between 0 and 1 that sum to 1.

In [None]:
probabilities = torch.softmax(logits, dim=0)
print(f"Probabilities: {probabilities}")
print(f"Sum: {probabilities.sum():.4f}")

To get the model's prediction, we find the **highest probability**:

In [None]:
predicted_index = torch.argmax(probabilities).item()
predicted_chord = CHORDS[predicted_index]
confidence = probabilities[predicted_index].item()

print(f"Predicted index: {predicted_index}")
print(f"Predicted chord: {predicted_chord}")
print(f"Confidence: {confidence:.1%}")

The complete pipeline:

```
Input chords → encode → MODEL → logits → softmax → argmax → decode → Output chord
```

In your own words, what are:
- **logits**
- **probabilities**

And why do we often apply **softmax** before choosing the top prediction?


### Double click here to edit

Replace this text with your response.

### Your Turn

Given these logits, what chord will the model predict?

In [None]:
test_logits = torch.tensor([0, 0, 0, 5, 0, 0, 0, 0, 0, 0])

my_guess = ""  # FILL THIS IN - a chord symbol

# Compute the answer
probs = torch.softmax(test_logits, dim=0, dtype=torch.float)
predicted = CHORDS[torch.argmax(probs).item()]
check(my_guess, predicted)

---
## Summary

**Pandas Skills:**
- Load data: `pd.read_csv()`
- Explore: `.shape`, `.columns`, `.head()`, `.tail()`, `.iloc[]`
- Filter: `df[df["col"] == "value"]`, `.str.contains()`
- Transform: `.str.split()`, `.apply()`
- Aggregate: `.value_counts()`, `.mean()`, `.max()`, `.min()`

**Encoding Skills:**
- **Vocabulary**: The set of all possible values (10 chords in our dataset)
- **stoi**: Dictionary mapping symbols to indices
- **One-hot encoding**: Vector of zeros with a single 1 at the item's index
- **Why one-hot**: Treats all categories as equidistant (no false ordering)
- **Sequences**: Stack and flatten multiple one-hot vectors for MLP input
- **Decoding**: Use `argmax` to go from model output back to a symbol

These techniques apply to any categorical data in machine learning!

---
## What's Next?

In the next notebook (`next_chord_prediction.py`), we'll use these encoding
skills to train a neural network that predicts the **next chord** in a
progression.

Given chords like `["I", "IV", "V", "I"]`, the model will learn to predict
what chord comes next!