# Dataset Review and Training Preprocessing

This notebook helps you review the dataset and prepare that data for training.

After completing you should understand the basics of:

1. **Pandas** — This Python library is often used in machine learning to load,
   explore, clean, and transform data.

2. **One-Hot Encoding** — a critical technique for converting categorical data
   (like chords) into numbers that neural networks can process.

We'll learn both by exploring a real dataset of chord progressions from
Billboard chart songs. Along the way, you'll complete exercises that require
you to write code and think carefully about what you're seeing.

In [1]:
import torch
import pandas as pd
from collections import Counter


def check(my_answer, correct):
    """Check your answer against the correct one."""
    if isinstance(my_answer, torch.Tensor):
        my_answer = my_answer.tolist()
    if isinstance(correct, torch.Tensor):
        correct = correct.tolist()

    if my_answer == correct:
        print("✓ Correct!")
    else:
        print(f"✗ Not quite. Expected: {correct}")

---
## Part 1: Introduction to Pandas

**Pandas** is a Python library for working with structured data. The core
object is the **DataFrame** — a 2D table with labeled rows and columns.

Think of a DataFrame like a spreadsheet: rows are records, columns are fields.

In [49]:
# https://mtec345.vercel.app/resources/billboard_numerals_simple.csv
df = df =df = pd.read_csv("https://mtec345.vercel.app/resources/billboard_numerals_simple.csv")

# Display the first few rows
df.head()

Unnamed: 0,id,title,artist,chart_date,chords
0,19,Here's Some Love,Tanya Tucker,1976-10-09,vi|ii|IV|V|I|IV|V|I|IV|V|IV|V|I|IV|V|I|vi|IV|V...
1,29,The Joker,Steve Miller Band,1973-11-10,I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|I|IV|V...
2,37,Foggy Mountain Breakdown,Flatt & Scruggs,1968-04-13,I|vi|I|vi|I|V|I|vi|I|vi|I|V|I|vi|I|vi|I|V|I|vi...
3,44,Some Like It Hot,The Power Station,1985-04-06,i|VI|VII|i|VI|VII|i|VI|i|VI|i|VI|i|VI|i|VI|i|V...
4,46,I'll Take You There,The Staple Singers,1972-07-01,I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I...


The `head()` method shows the first 5 rows by default. You can pass a number
to see more or fewer rows:

- `df.head(10)` — first 10 rows
- `df.tail(3)` — last 3 rows

In [6]:
print(f"Shape: {df.shape}")  # (rows, columns)
print(f"Columns: {list(df.columns)}")
print(f"Data types:\n{df.dtypes}")

Shape: (216, 5)
Columns: ['id', 'title', 'artist', 'chart_date', 'chords']
Data types:
id            int64
title           str
artist          str
chart_date      str
chords          str
dtype: object


### Key DataFrame Properties

| Property | Description | Example |
|----------|-------------|---------|
| `df.shape` | Tuple of (rows, columns) | `(220, 3)` |
| `df.columns` | List of column names | `['title', 'artist', 'chords']` |
| `df.dtypes` | Data type of each column | `object` means text/string |
| `len(df)` | Number of rows | `220` |

In [7]:
# Experiment with the properties above to get familiar with the data:
print(f"Shape: {df.shape}")
print(f"Number of songs: {len(df)}")
print(f"Columns: {list(df.columns)}")

Shape: (216, 5)
Number of songs: 216
Columns: ['id', 'title', 'artist', 'chart_date', 'chords']


---
## Part 2: Selecting Data in Pandas

Pandas offers several ways to access data:

| Syntax | What it does | Returns |
|--------|--------------|---------|
| `df["column"]` | Select one column | Series |
| `df[["col1", "col2"]]` | Select multiple columns | DataFrame |
| `df.iloc[0]` | Select row by position (index) | Series |
| `df.iloc[5:10]` | Select rows 5-9 | DataFrame |
| `df.loc[df["col"] == "x"]` | Select rows by condition | DataFrame |

In [8]:
titles = df["title"]
print(type(titles))  # pandas Series
titles.head()

<class 'pandas.Series'>


0            Here's Some Love
1                   The Joker
2    Foggy Mountain Breakdown
3            Some Like It Hot
4         I'll Take You There
Name: title, dtype: str

In [9]:
subset = df[["title", "artist"]]
print(type(subset))  # pandas DataFrame
subset.head()

<class 'pandas.DataFrame'>


Unnamed: 0,title,artist
0,Here's Some Love,Tanya Tucker
1,The Joker,Steve Miller Band
2,Foggy Mountain Breakdown,Flatt & Scruggs
3,Some Like It Hot,The Power Station
4,I'll Take You There,The Staple Singers


In [10]:
first_song = df.iloc[0]
print(f"Type: {type(first_song)}")
first_song

Type: <class 'pandas.Series'>


id                                                           19
title                                          Here's Some Love
artist                                             Tanya Tucker
chart_date                                           1976-10-09
chords        vi|ii|IV|V|I|IV|V|I|IV|V|IV|V|I|IV|V|I|vi|IV|V...
Name: 0, dtype: object

In [11]:
df.iloc[10:15]  # Rows 10-14

Unnamed: 0,id,title,artist,chart_date,chords
10,81,You Can Call Me Al,Paul Simon,1986-08-23,I|V|IV|V|I|V|IV|V|I|V|IV|V|I|V|IV|V|I|IV|I|IV|...
11,83,Red Red Wine,UB40,1984-04-07,I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|V|IV|V|I|IV|V|IV...
12,97,Motownphilly,Boyz II Men,1991-08-10,i|VI|V|i|VI|V|i|VI|V|i|VI|V|i|iv|V|VI|VII|V|i|...
13,106,If,Bread,1971-06-05,I|V|IV|iv|I|iv|V|I|V|IV|iv|I|iv|V|I|V|IV|iv|I|...
14,107,Sweet Home Alabama,Lynyrd Skynyrd,1974-10-26,V|IV|I|V|IV|I|V|IV|I|V|IV|I|V|IV|I|V|IV|I|V|IV...


In [17]:
df.iloc[10:16]  # Rows 10-15

Unnamed: 0,id,title,artist,chart_date,chords
10,81,You Can Call Me Al,Paul Simon,1986-08-23,I|V|IV|V|I|V|IV|V|I|V|IV|V|I|V|IV|V|I|IV|I|IV|...
11,83,Red Red Wine,UB40,1984-04-07,I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|V|IV|V|I|IV|V|IV...
12,97,Motownphilly,Boyz II Men,1991-08-10,i|VI|V|i|VI|V|i|VI|V|i|VI|V|i|iv|V|VI|VII|V|i|...
13,106,If,Bread,1971-06-05,I|V|IV|iv|I|iv|V|I|V|IV|iv|I|iv|V|I|V|IV|iv|I|...
14,107,Sweet Home Alabama,Lynyrd Skynyrd,1974-10-26,V|IV|I|V|IV|I|V|IV|I|V|IV|I|V|IV|I|V|IV|I|V|IV...
15,114,Sunshine Of Your Love,Cream,1968-03-30,I|IV|I|V|bVII|IV|V|bVII|IV|V|bVII|IV|V|I|IV|I|...


### Your Turn: Explore the Data

Use the selection methods above to browse the dataset. Try things like:
| Syntax                  | What it does |
|-------------------------|-----------------|
| `df.iloc[42]`           | look at different songs |
| `df.iloc[-1]`           | return the last row |
| `df.iloc[-2]`           | return the second-to-last row |
| `df["artist"].head(20)` | see more artists |
| `df.tail(10)`           | see the last 10 rows |
| `df.iloc[10:15]`        |  Rows 10-14 |
| `df["artist"].value_counts().head(10)`        |  Can you figure this out? |

In [12]:
# Try different selections to get a feel for the data:
df.iloc[-1]

id                                                         1292
title                                              He's So Fine
artist                                             The Chiffons
chart_date                                           1963-03-09
chords        ii|V|ii|V|ii|V|ii|V|ii|V|ii|V|I|ii|V|ii|V|ii|V...
Name: 215, dtype: object

In [16]:
df.iloc[13]

id                                                          106
title                                                        If
artist                                                    Bread
chart_date                                           1971-06-05
chords        I|V|IV|iv|I|iv|V|I|V|IV|iv|I|iv|V|I|V|IV|iv|I|...
Name: 13, dtype: object

In [14]:
df["artist"].value_counts().head(1)

artist
The Rolling Stones    4
Name: count, dtype: int64

In [21]:
df["artist"].value_counts().head(1000)##If you pass a number larger than the available rows, Pandas just returns all the rows.

artist
The Rolling Stones     4
John Denver            4
Dion                   3
The Everly Brothers    3
Eric Clapton           3
                      ..
Sonny & Cher           1
Bobby Darin            1
Shannon                1
The La's               1
The Chiffons           1
Name: count, Length: 164, dtype: int64

In [20]:
df["artist"].value_counts().head(90)##This code counts how many times each artist appears in the dataset, sorts them from most frequent to least frequent, and displays the top results. Even though only part of the list is shown on the screen, the output actually contains 90 artists.

artist
The Rolling Stones     4
John Denver            4
Dion                   3
The Everly Brothers    3
Eric Clapton           3
                      ..
The Isley Brothers     1
Peggy Lee              1
Janis Joplin           1
Aretha Franklin        1
Ocean                  1
Name: count, Length: 90, dtype: int64

In [26]:
# OPTIONAL!
# This uses some advanced features that we have not discussed. 
# If you are feeling brave, try to figure out why this works
artist_vc = df["artist"].value_counts()
artists_with_one_hit = artist_vc[artist_vc == 4]
artists_with_one_hit.head(15)
##This code counts how many times each artist appears and selects only the artists that appear exactly once in the dataset.

artist
The Rolling Stones    4
John Denver           4
Name: count, dtype: int64

In [27]:
# OPTIONAL!
# This uses some advanced features that we have not discussed. 
# If you are feeling brave, try to figure out why this works
artist_vc = df["artist"].value_counts()
artists_with_one_hit = artist_vc[artist_vc == 5]
artists_with_one_hit.head(15)

Series([], Name: count, dtype: int64)

In [30]:
# OPTIONAL!
# This uses some advanced features that we have not discussed. 
# If you are feeling brave, try to figure out why this works
artist_vc = df["artist"].value_counts()
artists_with_one_hit = artist_vc[artist_vc == 1]
artists_with_one_hit.head(4)

artist
Tanya Tucker         1
Flatt & Scruggs      1
The Power Station    1
The Beatles          1
Name: count, dtype: int64

---
## Part 3: Filtering and Aggregating

We can filter rows based on conditions using boolean indexing:

In [31]:
# Find all songs by a specific artist
stones_songs = df[df["artist"] == "The Rolling Stones"]
print(f"Found {len(stones_songs)} Rolling Stones songs")
stones_songs

Found 4 Rolling Stones songs


Unnamed: 0,id,title,artist,chart_date,chords
21,130,Tumbling Dice,The Rolling Stones,1972-06-24,IV|I|V|I|V|I|IV|V|I|V|I|V|I|IV|V|I|V|I|V|I|IV|...
49,278,Waiting On A Friend,The Rolling Stones,1982-01-09,I|ii|IV|I|ii|IV|I|ii|IV|I|ii|IV|I|ii|IV|I|ii|I...
85,457,Not Fade Away,The Rolling Stones,1964-06-13,I|IV|bVII|IV|I|IV|I|IV|bVII|IV|I|IV|bVII|IV|I|...
96,512,Going To A Go-Go,The Rolling Stones,1982-06-12,I|bVII|IV|I|bVII|I|bVII|IV|I|bVII|I|bVII|IV|I|...


In [35]:
# Use .str to access string methods on text columns
the_songs = df[df["artist"].str.contains("The")]
print(f"Found {len(michael_songs)} songs with 'The' in artist name")
the_songs

Found 1 songs with 'The' in artist name


Unnamed: 0,id,title,artist,chart_date,chords
3,44,Some Like It Hot,The Power Station,1985-04-06,i|VI|VII|i|VI|VII|i|VI|i|VI|i|VI|i|VI|i|VI|i|V...
4,46,I'll Take You There,The Staple Singers,1972-07-01,I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I...
5,50,Love Me Do,The Beatles,1964-07-11,I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I...
6,54,Last Kiss,J. Frank Wilson & The Cavaliers,1964-09-05,I|vi|IV|V|I|vi|IV|V|I|vi|IV|V|I|vi|IV|V|I|vi|I...
7,55,Smoking Gun,The Robert Cray Band,1987-03-07,i|iv|i|iv|i|iv|i|iv|i|iv|i|iv|i|iv|i|iv|i|iv|i...
8,66,Do You Love Me,The Contours,1988-06-18,I|IV|V|vi|V|I|IV|V|I|IV|V|I|IV|V|IV|iv|V|I|IV|...
9,72,Get Together,The Youngbloods,1969-09-06,I|bVII|I|bVII|I|bVII|I|bVII|I|bVII|IV|V|I|IV|V...
21,130,Tumbling Dice,The Rolling Stones,1972-06-24,IV|I|V|I|V|I|IV|V|I|V|I|V|I|IV|V|I|V|I|V|I|IV|...
26,153,Walk Right Back,The Everly Brothers,1961-05-01,I|V|I|IV|I|ii|IV|I|V|I|V|I|IV|I|ii|IV|I|V|I|V|I
38,228,Still Cruisin,The Beach Boys,1989-09-02,I|vi|I|vi|I|vi|I|vi|ii|V|ii|V|I|ii|V|ii|V|I|ii...


### Useful String Methods

| Method | Description |
|--------|-------------|
| `df["col"].str.contains("x")` | Check if contains substring |
| `df["col"].str.startswith("x")` | Check if starts with |
| `df["col"].str.lower()` | Convert to lowercase |
| `df["col"].str.split(",")` | Split by delimiter |

---
## Part 4: Working with the Chords Column

Our chords column contains strings like `"I|IV|V|I"` — chord symbols
separated by `|`. We need to **split** these strings into lists.

In [36]:
chords_before_split = df["chords"].iloc[0]
print("\nBefore splitting:", chords_before_split[0:36], "...\n")

chords_after_split = chords_before_split.split("|")
print("\nAfter splitting:", chords_after_split[:8], "...\n")  # show the beginning


Before splitting: vi|ii|IV|V|I|IV|V|I|IV|V|IV|V|I|IV|V ...


After splitting: ['vi', 'ii', 'IV', 'V', 'I', 'IV', 'V', 'I'] ...



### Your Turn: What Changed?

In a **markdown cell**, explain what the difference is between:

- The value **before** splitting
- The value **after** splitting

Your answer should mention:
- The **data type** (use the `type(chords_before_split)` function if you are unsure)
- What is the advantage of the new format?

Write your answer in the next cell.


### type(chords_before_split
Before splitting, the chords are stored as a single string, where all chord symbols are combined and separated by the "|" character.
After splitting, the data becomes a list, and each chord is stored as its own element.
This new format is useful because it allows us to access and process individual chords, making it easier to analyze the data and prepare it for machine learning.

In [37]:
# Create a new 'chords' column.

chords_split = df["chords"].str.split("|") # create a new column

print('\ndf["chords"].head():')
print(df["chords"].head())


df["chords"].head():
0    vi|ii|IV|V|I|IV|V|I|IV|V|IV|V|I|IV|V|I|vi|IV|V...
1    I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|I|IV|V...
2    I|vi|I|vi|I|V|I|vi|I|vi|I|V|I|vi|I|vi|I|V|I|vi...
3    i|VI|VII|i|VI|VII|i|VI|i|VI|i|VI|i|VI|i|VI|i|V...
4    I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I|IV|I...
Name: chords, dtype: str


In [38]:
# Correct version

df["chords"] = df["chords"].str.split("|") 

print('\ndf["chords"].head():')
print(df["chords"].head())


df["chords"].head():
0    [vi, ii, IV, V, I, IV, V, I, IV, V, IV, V, I, ...
1    [I, IV, V, IV, I, IV, V, IV, I, IV, V, IV, I, ...
2    [I, vi, I, vi, I, V, I, vi, I, vi, I, V, I, vi...
3    [i, VI, VII, i, VI, VII, i, VI, i, VI, i, VI, ...
4    [I, IV, I, IV, I, IV, I, IV, I, IV, I, IV, I, ...
Name: chords, dtype: object


The output of the previous cell shows `dtype:str`. Interesting!

That means that df["chords"] still contains strings...

Do you undertand why calling `df["chords"].str.split("|")` did not change the values in df["chords"] to strings?
 
Before reading the explanation below, write your best guess:
**Why didn't `df["chords"].str.split("|")` change the values inside `df["chords"]`?**


### Why didn't df["chords"].str.split("|") change the values inside df["chords"]?

The split operation creates a new Series, but it doesn’t modify the original column unless we assign it back. That’s why df["chords"] still contains strings.

### Separating Two Ideas

**Idea 1: Splitting**

- `df["chords"].str.split("|")` computes a *new* Series where each string
  becomes a list. It does not mutate the existing DataFrame.

**Idea 2: Assigning back into the DataFrame**

- `df["chords"] = ...` stores the result into the DataFrame column.
- This ability to use `=` to edit a colum is part of what make pandas useful.

We'll overwrite `df["chords"]` because the rest of the lab expects each row
to contain a list of chord symbols (not one long string).

In [41]:
chords_split = df["chords"].str.split("|") # create a new column
df["chords"] = chords_split                # assign the mutated data

print("After assigning back into df['chords']:")
print(df["chords"].head())
print(f"\nType of first value: {type(df['chords'].iloc[0])}")

After assigning back into df['chords']:
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: chords, dtype: object

Type of first value: <class 'float'>


Now each cell contains a **list** of chord symbols instead of a single string.

The chords are written in **Roman numeral notation**, which describes chords
relative to the key of the song:

- **I** = the "one" chord (tonic major)
- **IV** = the "four" chord (subdominant)
- **V** = the "five" chord (dominant)
- **vi** = the "six" chord (relative minor)

This notation is key-independent.

**What are the advantages and disadvantages of this approach for machine learning?** 

Make a guess and put it in the next cell.


### What are the advantages and disadvantages of this approach for machine learning?
Using Roman numeral notation is useful because it makes chord progressions key-independent. This allows the model to focus on harmonic patterns rather than specific keys, which can improve generalization across different songs.
However, a disadvantage is that removing key information can also reduce musical expressiveness. Without knowing the actual key and pitch content, the model may lose important information about color, tension, and voicing, which can affect chord choice and the sense of harmonic direction in music creation. Different songs in different keys may look identical even though they sound different.

In [50]:
sample = df.iloc[0]
print(f"Title: {sample['title']}")
print(f"Artist: {sample['artist']}")
print(f"Number of chords: {len(sample['chords'])}")
print(f"First 4 chords: {sample['chords'][:8]}")

Title: Here's Some Love
Artist: Tanya Tucker
Number of chords: 159
First 4 chords: vi|ii|IV


### Your Turn: Exploring Chord Progressions

Let's practice accessing the chord data step by step.

In [54]:
# First, select a song using iloc
song = df.iloc[190]
print(f"Song: '{song['title']}' by {song['artist']}")

Song: 'What You Get Is What You See' by Tina Turner


In [55]:
# The "chords" field is now a Python list
chord_list = song["chords"]
print(f"Type: {type(chord_list)}")
print(f"First few chords: {chord_list[:8]}")

Type: <class 'str'>
First few chords: I|IV|V|I


In [56]:
# Remember: len() gives the length of any list
num_chords = len(chord_list)
print(f"This song has {num_chords} chords")

This song has 581 chords


In [57]:
# Use list indexing: [0] for first, [-1] for last
print(f"First chord: {chord_list[0]}")
print(f"Last chord: {chord_list[-1]}")
print(f"Chords 4-8: {chord_list[4:8]}")

First chord: I
Last chord: V
Chords 4-8: |V|I


### Explore on Your Own

Pick a few songs and examine their chord progressions:
- Which song has the most chords?
- Do most songs start on the same chord?
- Do songs tend to end on the "I" chord?

In [58]:
# Try looking at different songs:
df.iloc[50]["chords"][:10]

'I|IV|V|I|I'

In [61]:
# Try looking at different songs:
df.iloc[11]["chords"][:40]

'I|IV|V|IV|I|IV|V|IV|I|IV|V|IV|V|IV|V|I|I'

In [59]:
print(len(df.iloc[0]["chords"]))
print(len(df.iloc[50]["chords"]))
print(len(df.iloc[100]["chords"]))

159
50
203


In [59]:
print(len(df.iloc[0]["chords"]))
print(len(df.iloc[50]["chords"]))
print(len(df.iloc[100]["chords"]))

159
50
203


In [64]:
df["num_chords"] = df["chords"].apply(len)
df["num_chords"].max()
df.loc[df["num_chords"].idxmax()]

id                                                          451
title                                               Got It Made
artist                                     Crosby, Stills, Nash
chart_date                                           1989-03-11
chords        IV|vi|V|IV|vi|V|IV|vi|V|IV|vi|V|I|IV|I|V|I|V|I...
num_chords                                                  639
Name: 81, dtype: object

In [65]:
###.apply(len) → applies the len function to each list and computes its length
###.max() → returns the maximum value
###.idxmax() → returns the index (row number) of the maximum value

In [67]:
start_chords = df["chords"].str[0]
end_chords = df["chords"].str[-1]
start_chords.value_counts().head()
end_chords.value_counts().head()

chords
I    129
V     56
i     30
v      1
Name: count, dtype: int64

---
## Part 5: Computing New Columns

We can create new columns by applying functions to existing ones.
Let's add a column for the number of chords in each song.

In [68]:
# .apply(len) runs len() on each row's chord list
df["num_chords"] = df["chords"].apply(len)
df[["title", "artist", "num_chords"]].head(10)

Unnamed: 0,title,artist,num_chords
0,Here's Some Love,Tanya Tucker,159
1,The Joker,Steve Miller Band,353
2,Foggy Mountain Breakdown,Flatt & Scruggs,183
3,Some Like It Hot,The Power Station,175
4,I'll Take You There,The Staple Singers,194
5,Love Me Do,The Beatles,148
6,Last Kiss,J. Frank Wilson & The Cavaliers,179
7,Smoking Gun,The Robert Cray Band,51
8,Do You Love Me,The Contours,271
9,Get Together,The Youngbloods,203


In [75]:
# Pandas makes it easy to compute summary statistics
print(f"Average chords per song: {df['num_chords'].mean():.1f}")
print(f"Minimum: {df['num_chords'].min()}")
print(f"Maximum: {df['num_chords'].max()}")
print(f"Median: {df['num_chords'].median()}")

Average chords per song: 186.3
Minimum: 1
Maximum: 639
Median: 162.0


In [76]:
# These return the INDEX of the max/min value
longest_song = df.loc[df["num_chords"].idxmax()]
print(f"Longest: '{longest_song['title']}' by {longest_song['artist']} ({longest_song['num_chords']} chords)")

shortest_song = df.loc[df["num_chords"].idxmin()]
print(f"Shortest: '{shortest_song['title']}' by {shortest_song['artist']} ({shortest_song['num_chords']} chords)")

Longest: 'Got It Made' by Crosby, Stills, Nash (639 chords)
Shortest: 'La Grange' by ZZ Top (1 chords)


In [77]:
# Which songs have more than 200 chords?
long_songs = df[df["num_chords"] > 200]
print(f"Songs with 200+ chords: {len(long_songs)}")
long_songs[["title", "artist", "num_chords"]]

Songs with 200+ chords: 73


Unnamed: 0,title,artist,num_chords
1,The Joker,Steve Miller Band,353
8,Do You Love Me,The Contours,271
9,Get Together,The Youngbloods,203
10,You Can Call Me Al,Paul Simon,500
11,Red Red Wine,UB40,421
...,...,...,...
205,Lucky Man,"Emerson, Lake & Palmer",207
207,Island Of Lost Souls,Blondie,421
208,Baby Don't Go,Sonny & Cher,336
212,Let The Music Play,Shannon,440


---
## Part 6: Building the Vocabulary

Before we can encode chords for a neural network, we need to know what
chords exist in our dataset. This set of all possible values is called
the **vocabulary**.

In [78]:
all_chords = []
for chord_list in df["chords"]:
    if chord_list:  # Skip empty lists
        all_chords.extend(chord_list)

print(f"Total chord occurrences: {len(all_chords)}")

Total chord occurrences: 40235


In [79]:
unique_chords = sorted(set(all_chords))
print(f"Found {len(unique_chords)} unique chords:")
print(unique_chords)

Found 6 unique chords:
['I', 'V', 'b', 'i', 'v', '|']


of categories.

Notice the mix of major (uppercase) and minor (lowercase):
- Major: I, IV, V, VI, VII, bVII
- Minor: i, ii, iv, vi


To train our first neural network, we are going to use a small vocabulary.

The original dataset includes chords with inversions and much more complicated
variations. You make want to use these more complex data in the futue, but for
now we have simplified 7th chords and ommited many songs from the dataset that
have more complicated chords.
**That is why our dataset includes only 216 songs, then the original contains over 700.**

---
## Part 7: Chord Distribution

How often does each chord appear? This tells us about common patterns in
popular music.

In [80]:
chord_counts = Counter(all_chords)

print("Chord distribution (most common first):\n")
for chord, count in chord_counts.most_common():
    bar = "█" * (count // 500)  # Simple text bar chart
    print(f"  {chord:5s}: {count:5d} {bar}")

Chord distribution (most common first):

  |    : 15651 ███████████████████████████████
  I    : 11391 ██████████████████████
  V    :  8621 █████████████████
  i    :  3218 ██████
  v    :   902 █
  b    :   452 


**What does this tell us?**

- **I** dominates — most songs emphasize the tonic chord
- **IV** and **V** are the next most common (the classic I-IV-V progression)
- Minor chords (i, vi, ii, iv) appear less frequently
- **bVII** and **VII** are relatively rare

This imbalance will affect our neural network — it will learn more about
common chords than rare ones.

In 2–4 sentences:
- What chord(s) seem most common?
- How might this imbalance affect a model trained to predict the next chord?


### What chord(s) seem most common?
### How might this imbalance affect a model trained to predict the next chord?
The most common chords are I, V, and IV — basically the I-IV-V progression, which you hear literally everywhere in pop music.
Since the data has way more I, IV, and V than anything else, the model will probably just keep predicting those same chords over and over. It won't really "know" the rarer chords like bVII because it barely saw them during training, so the predictions might end up sounding kind of boring and repetitive.

In [81]:
# What fraction of all chords is just "I"?
i_percentage = chord_counts["I"] / len(all_chords) * 100
print(f"The 'I' chord makes up {i_percentage:.1f}% of all chord occurrences!")

The 'I' chord makes up 28.3% of all chord occurrences!


---
## Part 8: Why One-Hot Encoding?

Now we need to convert these chord symbols into numbers for our neural network.

**The naive approach**: assign each chord an integer.

In [82]:
naive_encoding = {chord: i for i, chord in enumerate(unique_chords)}
print("Naive integer encoding:")
for chord, idx in naive_encoding.items():
    print(f"  {chord} → {idx}")

Naive integer encoding:
  I → 0
  V → 1
  b → 2
  i → 3
  v → 4
  | → 5


**Why is this problematic?**

If we feed these integers directly into a linear layer, the model might learn
that:
- "I" (0) and "IV" (1) are "close" because 0 and 1 are close numbers
- "I" (0) and "vi" (9) are "far apart" because 0 and 9 are distant

But musically, this makes no sense! The relationship between chords isn't
captured by arbitrary index numbers.

**One-hot encoding** solves this by treating all categories as **equidistant**.
Each chord becomes a vector of zeros with a single 1 at its position.

### Think About It

In the naive encoding, "IV" is encoded as 1 and "V" as 2. A neural network
might learn these are "close" because 1 and 2 are close numbers.

Is this good or bad? Actually, IV and V *are* musically related — they're
both major chords commonly used together. So sometimes the naive encoding
creates useful relationships **by accident**.

The problem is that it also creates *meaningless* relationships. Is "I" (0)
really maximally different from "vi" (9)? Not musically!

**One-hot encoding** gives us a clean slate. All chords start equidistant,
and the network learns *actual* relationships from the data.

In your own words:
- Why can integer (naive) encoding be misleading for chord categories?
- What does one-hot encoding fix?


### One-Hot Encoding

Why can integer encoding be misleading?
When you just assign numbers like I=0, IV=1, V=2, the model thinks chords with close numbers are actually similar. But that's totally random — I only got 0 because it happened to come first in the list, not because it's "close" to IV musically. So the model ends up learning fake relationships that don't mean anything.
What does one-hot encoding fix?
One-hot encoding makes every chord equally "far" from every other chord — no chord is closer or further just because of its number. So the model has to actually learn the real relationships from the data, instead of getting confused by meaningless numbers.

---
## Part 9: Build Encoder/Decoder Functions

Let's build functions to convert between chord symbols and one-hot vectors.

In [102]:
CHORDS = sorted(set(all_chords))  # Our vocabulary
NUM_CHORDS = len(CHORDS)

# Use a python "dictionary comprehension" to construct a mapping dict

stoi = {chord: i for i, chord in enumerate(CHORDS)}

# `stoi` is a python dictionary that maps string=>index

print(f"Vocabulary size: {NUM_CHORDS}\n")
stoi

Vocabulary size: 10



{'I': 0,
 'IV': 1,
 'V': 2,
 'VI': 3,
 'VII': 4,
 'bVII': 5,
 'i': 6,
 'ii': 7,
 'iv': 8,
 'vi': 9}

In [103]:
print(all_chords[:20])
print(type(all_chords[0]))

['vi', 'ii', 'IV', 'V', 'I', 'IV', 'V', 'I', 'IV', 'V', 'IV', 'V', 'I', 'IV', 'V', 'I', 'vi', 'IV', 'V', 'IV']
<class 'str'>


In [104]:
all_chords = [chord for song in df['chords'] for chord in song.split('|')]
unique_chords = sorted(set(all_chords))
stoi = {chord: i for i, chord in enumerate(unique_chords)}
NUM_CHORDS = len(unique_chords)
print(stoi)
print(f"NUM_CHORDS: {NUM_CHORDS}")

{'I': 0, 'IV': 1, 'V': 2, 'VI': 3, 'VII': 4, 'bVII': 5, 'i': 6, 'ii': 7, 'iv': 8, 'vi': 9}
NUM_CHORDS: 10


In [105]:
def encode(chord):
    """One-hot encode a chord symbol."""
    index = stoi[chord]
    vector = torch.zeros(NUM_CHORDS)
    vector[index] = 1.0
    return vector


# Test it
print('encode("I") =', encode("I"))
print('encode("V") =', encode("V"))
print('encode("vi") =', encode("vi"))

encode("I") = tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
encode("V") = tensor([0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])
encode("vi") = tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 1.])


In [106]:
print(stoi) 

{'I': 0, 'IV': 1, 'V': 2, 'VI': 3, 'VII': 4, 'bVII': 5, 'i': 6, 'ii': 7, 'iv': 8, 'vi': 9}


In [107]:
def decode(vector):
    """Decode a one-hot vector back to a chord symbol."""
    index = torch.argmax(vector).item()
    return CHORDS[index]


# Test round-trip
original = "IV"
encoded = encode(original)
decoded = decode(encoded)
print(f"Original: {original}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

Original: IV
Encoded: tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 0.])
Decoded: IV


### Your Turn

Predict what `encode("ii")` will produce:

In [110]:
my_guess = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]  # FILL THIS IN - a list of 10 values (0s and 1s)

check(my_guess, encode("ii").tolist())

✓ Correct!


Now predict what `decode(torch.tensor([0, 0, 0, 0, 0, 0, 0, 1, 0, 0]))` returns:

In [111]:
my_guess = "ii"  # FILL THIS IN - a chord symbol like "I" or "vi"

check(my_guess, decode(torch.tensor([0, 0, 0, 0, 0, 0, 0, 1, 0, 0])))

✓ Correct!


---
## Part 10: Encode a Sequence

Neural networks often need to process **sequences** of chords, not just
single chords. Let's encode a chord progression.

In [112]:
progression = ["I", "IV", "V", "I"]

# Encode each chord
encoded_chords = [encode(c) for c in progression]

# Look at the individual encodings
for chord, vec in zip(progression, encoded_chords):
    print(f"{chord}: {vec}")

I: tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
IV: tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 0.])
V: tensor([0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])
I: tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


In [113]:
stacked = torch.stack(encoded_chords)
print(f"\nStacked shape: {stacked.shape}")
print(stacked)


Stacked shape: torch.Size([4, 10])
tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])


This is a `(4, 10)` tensor — 4 chords, each with 10 values.

For a multi-layer perceptron (MLP), we usually want a **flat** 1D input.
We can flatten the matrix:

In [114]:
flattened = stacked.flatten()
print(f"Flattened shape: {flattened.shape}")
print(flattened)

Flattened shape: torch.Size([40])
tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])


Now our 4-chord progression is a single vector of **40 values** (4 × 10).

This is exactly what we'll feed into our neural network!

Explain (briefly) what `.flatten()` did here.

Your answer should mention:
- How the **shape** changed
- Why we might prefer a flat 1D vector for an MLP


### .flatten()

.flatten() took the 4×10 matrix and turned it into a single vector of 40 numbers. Basically the shape went from (4, 10) → (40,).
We need it flat because an MLP just takes one straight line of numbers as input — it can't handle a 2D grid. So flattening is like unfolding the table into one long row before feeding it into the network.

In [115]:
def encode_sequence(chords):
    """Encode a list of chords as a flat tensor."""
    encoded = [encode(c) for c in chords]
    return torch.stack(encoded).flatten()


# Test it
test_seq = ["I", "V", "vi", "IV"]
encoded = encode_sequence(test_seq)
print(f"Sequence: {test_seq}")
print(f"Encoded shape: {encoded.shape}")

Sequence: ['I', 'V', 'vi', 'IV']
Encoded shape: torch.Size([40])


### Your Turn

If we encode a sequence of **5 chords**, what will the flattened tensor's
shape be?

In [116]:
my_guess = 50  # FILL THIS IN - a single number

five_chord_seq = ["I", "IV", "V", "IV", "I"]
actual_shape = encode_sequence(five_chord_seq).shape[0]
check(my_guess, actual_shape)

✓ Correct!


---
## Part 11: Decoding Model Output

When a neural network makes a prediction, it doesn't output a perfect one-hot
vector. Instead, it outputs **raw scores** (called "logits") for each category.

These can be any numbers — positive, negative, large, small.

In [117]:
# Pretend this came from a neural network
logits = torch.tensor([0.5, -0.2, 0.1, 0.8, -0.5, 2.1, 0.3, -0.1, 1.5, 0.4])
print(f"Logits: {logits}")
print(f"Shape: {logits.shape}")

Logits: tensor([ 0.5000, -0.2000,  0.1000,  0.8000, -0.5000,  2.1000,  0.3000, -0.1000,
         1.5000,  0.4000])
Shape: torch.Size([10])


To get **probabilities**, we apply the **softmax** function.
It converts raw scores into values between 0 and 1 that sum to 1.

In [118]:
probabilities = torch.softmax(logits, dim=0)
print(f"Probabilities: {probabilities}")
print(f"Sum: {probabilities.sum():.4f}")

Probabilities: tensor([0.0723, 0.0359, 0.0485, 0.0976, 0.0266, 0.3582, 0.0592, 0.0397, 0.1966,
        0.0654])
Sum: 1.0000


To get the model's prediction, we find the **highest probability**:

In [119]:
predicted_index = torch.argmax(probabilities).item()
predicted_chord = CHORDS[predicted_index]
confidence = probabilities[predicted_index].item()

print(f"Predicted index: {predicted_index}")
print(f"Predicted chord: {predicted_chord}")
print(f"Confidence: {confidence:.1%}")

Predicted index: 5
Predicted chord: bVII
Confidence: 35.8%


The complete pipeline:

```
Input chords → encode → MODEL → logits → softmax → argmax → decode → Output chord
```

In your own words, what are:
- **logits**
- **probabilities**

And why do we often apply **softmax** before choosing the top prediction?


### Decoding Model Output¶
Logits = your feeling
"I kinda want cake, not really feeling ice cream, pretty into pudding."
Just vague feelings, no actual numbers.
Softmax = turning that feeling into percentages
"36% pudding, 20% cake, 7% ice cream..."
All adds up to 100%.

Because without softmax, you just have raw scores that are hard to compare — like is 2.1 really that much better than 1.5? But after softmax, they become percentages, so you can clearly see "36% vs 20%" and confidently pick the highest one.

### Your Turn

Given these logits, what chord will the model predict?

In [120]:
test_logits = torch.tensor([0, 0, 0, 5, 0, 0, 0, 0, 0, 0])

my_guess = "VI"  # FILL THIS IN - a chord symbol

# Compute the answer
probs = torch.softmax(test_logits, dim=0, dtype=torch.float)
predicted = CHORDS[torch.argmax(probs).item()]
check(my_guess, predicted)

✓ Correct!


---
## Summary

**Pandas Skills:**
- Load data: `pd.read_csv()`
- Explore: `.shape`, `.columns`, `.head()`, `.tail()`, `.iloc[]`
- Filter: `df[df["col"] == "value"]`, `.str.contains()`
- Transform: `.str.split()`, `.apply()`
- Aggregate: `.value_counts()`, `.mean()`, `.max()`, `.min()`

**Encoding Skills:**
- **Vocabulary**: The set of all possible values (10 chords in our dataset)
- **stoi**: Dictionary mapping symbols to indices
- **One-hot encoding**: Vector of zeros with a single 1 at the item's index
- **Why one-hot**: Treats all categories as equidistant (no false ordering)
- **Sequences**: Stack and flatten multiple one-hot vectors for MLP input
- **Decoding**: Use `argmax` to go from model output back to a symbol

These techniques apply to any categorical data in machine learning!

---
## What's Next?

In the next notebook (`next_chord_prediction.py`), we'll use these encoding
skills to train a neural network that predicts the **next chord** in a
progression.

Given chords like `["I", "IV", "V", "I"]`, the model will learn to predict
what chord comes next!