# Notebook 1 : Establishing the Data

This notebook will walk the reader through four things: acquiring and reading the data, explaining the methods of the data, and describing its structre.

___
## Acquire and Read in the Data Source(s)

First, we load the datasets retrieved from MGPHot: A Dataset of
Musicological Annotations for Popular Music(1958–2022) and The Billboard Melodic Music Dataset (BiMMuDa) using the pandas library. The files are provided as a CSV and TSV, respectfully, and are stored locally or in a repository. 

In [30]:
#loading MGPHOT
import pandas as pd
GENE_VALS = pd.read_csv('mgphot_gene_values.tsv', 
                 sep='\t',                  #specify tab as delimiter for TSV files
                 on_bad_lines='skip')

GENE_NAMES = pd.read_csv('mgphot_genes.tsv',
                        sep='\t',
                        on_bad_lines='skip')

HOT100 = pd.read_csv('hot100_charts.tsv',
                        sep='\t',
                        on_bad_lines='skip')

import ast

GENE_VALS['gene_values'] = GENE_VALS['gene_values'].apply(lambda x: ast.literal_eval("".join(x)))
#using ast to fix the list (before it was like ['[','0','.','4',',',' '])

In [31]:
#loading BiMMuDa
TOP_5 = pd.read_csv('bimmuda_genre.csv')
#Download directly from the GitHub to get the csv that has genre information. It is initially called bimmuda_per_song_full.csv.
TOP_MELODY = pd.read_csv('bimmuda_per_melody_full.csv')
TOP_SONG = pd.read_csv("bimmuda_per_song_full.csv")

___
## Describing How to Get the Data

To obtain the original datasets, visit the GitHub respositories linked below. The files are freely accessible and can be downloaded without any form of authentication. 

Here are the following links:

MGPHot (Top 100): https://github.com/utdata/rwd-billboard-data

BiMMuDa (Top 5): https://github.com/madelinehamilton/BiMMuDa/

For MGPHot, the GitHub Actions automatically scrape and combine charts weekly. No API key is needed, so files are publicly available and can be downloaded directly.

To replicate the acquisition process, users can:

Clone or download the GitHub repository, or

Use the same R scripts (action_scrape_charts.R, action_combine_charts.R) included in the .github/workflows folder.

___
## Data Provenance/Origin

#### MGPHot

The MGPHot data comes from the rwd-billboard-data GitHub repository, which maintains an ongoing archive of both the Billboard Hot 100 and Billboard 200 charts. Anyone can obtain the data by visiting the project’s repository and downloading either:

- data-out/hot-100-current.csv (Hot 100 since 1958)

- data-out/billboard-200-current.csv (Billboard 200 since 1967)

#### Describe Who Produced the Data and How?

This dataset is produced and maintained by contributors to the rwd-billboard-data GitHub project. It aggregates Billboard chart information through a combination of:

- GitHub Actions scraping:
  Automated jobs scrape the newest Hot 100 and Billboard 200 charts online each week using R scripts.

- Kaggle dataset:
  An older archive (hot100_kaggle_195808_20211106.csv) covering 1958–2021.

- Google Sheet scraping:
  Chrome extension “Data Miner” was used to fill in gaps left by the Kaggle dataset.

- data.world Hot 100 archive:
  Another partially-complete archive used to fill missing records through June 2021.

#### How Is It Produced?

Two automated GitHub workflows run Tuesday–Friday:

scrap_charts.yml calls action_scrape_charts.R to download the current weekly chart data directly from Billboard charts data (that's all that is said in the GitHub).

combine_charts.yml calls action_combine_charts.R to merge new charts with existing archives.

Additional RMarkdown notebooks (01-scrape-charts.Rmd, 02-combine-charts.Rmd, 03-check-charts.Rmd) support maintenance, cleaning, and validation.

___
#### BiMMuDA 
The BiMMuDa dataset comes from the madelinehamilton GitHub repository called BiMMuDa, and is a static representation of the top 5 charting Billboard songs from 1950 to 2022.

#### Who Produced the Data and How?
The BiMMuDa dataset was compiled by a research team studying Western popular music. For each year (1950–2022), they manually transcribed the main melody of the top five Billboard year-end singles. When available, high-quality multitrack MIDI files were used; otherwise melodies were manually transcribed by trained musicians. 


#### How Is It Produced?
Each melody was checked, cleaned, segmented into song sections (verse, chorus, bridge, etc.), and paired with metadata including key, tempo, structure, and tonic. The dataset is fully manually reviewed and quality-checked through comparison with the CoCoPops melodic corpus. Confirming that the transcriptions from both datasets agree with each other establish the quality of the transcriptions in BiMMuDa.


___
## Data Features

In [32]:
#MGPHot Gene Values
GENE_VALS_COLS = pd.DataFrame({
    "Column Name": GENE_VALS.columns,
    "Non-Null Count": GENE_VALS.notnull().sum().values,
    "Dtype": GENE_VALS.dtypes.values,
})

GENE_VALS_COLS

Unnamed: 0,Column Name,Non-Null Count,Dtype
0,year,21299,int64
1,mgphot_track_id,21299,int64
2,artist,21299,object
3,title,21299,object
4,gene_values,21299,object


In [33]:
#MGPHot Gene Types
GENE_COLS = pd.DataFrame({
    "Column Name": GENE_NAMES.columns,
    "Non-Null Count": GENE_NAMES.notnull().sum().values,
    "Dtype": GENE_NAMES.dtypes.values,
})

GENE_COLS

Unnamed: 0,Column Name,Non-Null Count,Dtype
0,gene_id,58,int64
1,name,58,object
2,category,58,object
3,description,58,object


In [34]:
#MGPHot Hot 100
HOT100_COLS = pd.DataFrame({
    "Column Name": HOT100.columns,
    "Non-Null Count": HOT100.notnull().sum().values,
    "Dtype": HOT100.dtypes.values,
})

HOT100_COLS

Unnamed: 0,Column Name,Non-Null Count,Dtype
0,chart_week,334500,object
1,current_week_position,334500,int64
2,artist,334500,object
3,title,334500,object
4,mgphot_track_id,259765,float64
5,last_week_position,302040,float64
6,peak_position,334500,int64
7,weeks_on_chart,334500,int64


In [35]:
#BiMMuDa with genre
TOP5_GENRE_COLS = pd.DataFrame({
    "Column Name": TOP_5.columns,
    "Non-Null Count": TOP_5.notnull().sum().values,
    "Dtype": TOP_5.dtypes.values,
})

TOP5_GENRE_COLS

Unnamed: 0,Column Name,Non-Null Count,Dtype
0,Title,381,object
1,Artist,381,object
2,Year,381,int64
3,Position,381,object
4,Genre (Broad 1),381,object
5,Genre (Broad 2),147,object
6,Genre (Specific 1),241,object
7,Genre (Specific 2),80,object
8,Genre (Specific 3),26,object
9,Link to Audio,381,object


In [36]:
#BiMMuDa per melody
TOPMEL_COLS = pd.DataFrame({
    "Column Name": TOP_MELODY.columns,
    "Non-Null Count": TOP_MELODY.notnull().sum().values,
    "Dtype": TOP_MELODY.dtypes.values,
})

TOPMEL_COLS

Unnamed: 0,Column Name,Non-Null Count,Dtype
0,ID,1133,object
1,Year,1133,int64
2,Position,1133,object
3,Label,1042,object
4,BPM,1133,int64
5,Mode,1133,object
6,Tonic,1133,object
7,Length,1133,float64
8,Number of Note Events,1133,int64
9,Tonality,1133,float64


In [37]:
#BiMMuDa per song
TOPSONG_COLS = pd.DataFrame({
    "Column Name": TOP_SONG.columns,
    "Non-Null Count": TOP_SONG.notnull().sum().values,
    "Dtype": TOP_SONG.dtypes.values,
})

TOPSONG_COLS

Unnamed: 0,Column Name,Non-Null Count,Dtype
0,Title,379,object
1,Artist,379,object
2,Year,379,int64
3,Position,379,object
4,Link to Audio,379,object
5,Tonic 1,379,object
6,Tonic 2,10,object
7,Tonic 3,1,object
8,Mode 1,379,object
9,Mode 2,10,object


___
## Errors & Limitations

#### Billboard Chart Limitations
It must be noted that Billboard’s charts are limited as a measure of popularity due to the fact that the charts used for this dataset track year-end popularity only in the United States, which may not capture global or even broader Western popularity trends. Additionally, artists and labels have historically manipulated chart performance (Andrews, 2018).

Billboard’s internal policies have also prevented major hits from charting accurately in the following ways:

- Pre-1991: Year-end rankings could “split” the success of late-year releases across two years, making them appear less popular than they were.

- 1990s eligibility rules: Some massively successful songs were ineligible for year-end charts if they were not released as purchasable singles.

- Example: “Iris” by the Goo Goo Dolls — #1 on Hot 100 Airplay for 18 weeks in 1998, but did not appear on the year-end singles chart.

Despite these issues, Billboard remains the most widely recognized measure of U.S. musical popularity, and the top year-end singles continue to serve as strong indicators of mainstream U.S. music trends.

___
#### Known MGPHot Hot 100 Data Errors 

1970-01-10 & 1970-01-17: “Rainy Night In Georgia/Rubberneckin’” by Brook Benton appears in a way many believe to be a data error, especially given Elvis’s “Rubberneckin’” charting higher the same week.

1961 Duplicate Entries:

“Every Beat of My Heart” is duplicated in several weeks. One entry is credited to The Pips, another to Gladys Knight & The Pips. This appears in both data.world archives and Billboard’s current website. Gaps in the Kaggle archive were partially filled manually (via Chrome Data Miner) and supplemented with data.world, which may introduce slight inconsistencies.

#### Structural / Methodological Limitations

Pre-2022 data depends on assembled archives with occasional missing weeks or misformatted entries.

Week-to-week comparisons may contain minor inconsistencies from merged sources.

Genre, artist, or format labeling sometimes varies depending on source.

___
#### BiMMuDa Melody Dataset
##### Singles With No Main Melody

There are songs that rely on spoken delivery or rhythmic chant rather than a singable melodic line. They include lyrics only, with no MIDI or MuseScore melody file. Additionally, there are hits driven by instrumental lead melodies, meaning the dataset includes MIDI only—no lyrics files. The list and names of these songs can be found at the BiMMuDA GitHub repository.

#### General Limitations of BiMMuDa

- Melodies are transcribed monophonically, losing expressive nuance such as vibrato, microtiming, ad-libs, and vocal inflections.
- 
Songs with complex textures or multiple leads require subjective decisions about which line counts as the “main melody.”

- Some section boundaries (verse, chorus, bridge) require interpretation when forms are ambiguous.

- The dataset includes only top five year-end Billboard singles per year, which reflects U.S. popular exposure but not the diversity of global pop.