This repository builds a series of classifiers that predict whether melodies from the Essen corpus originate from China or Europe, using numeric features from the Python package melody-features as predictors. It reproduces the manuscript’s confusion-matrix figures, runs exploratory factor analysis and a factor-based logistic model in R, and benchmarks logistic regression for each feature-extraction source (IDyOM, jSymbolic, etc.).
Run everything from the repo root unless noted otherwise.
| Requirement | Notes |
|---|---|
| Python 3.10+ | Check with python3 --version. |
| R | Check with Rscript --version. Install from CRAN. |
melody-features |
Installed via requirements.txt. It ships the Essen corpus used to resolve melody paths from basename lists. |
1. Clone and enter the repo
git clone https://github.com/dmwhyatt/Style-Classification-Analysis.git
cd Style-Classification-Analysis2. Python environment
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt3. R packages (for factor_logistic.R only)
Rscript -e 'install.packages(c("tidyverse", "psych", "jsonlite"), repos="https://cloud.r-project.org")'ggplot2 is included in tidyverse.
4. melody-features
Feature extraction is run using the Python package defaults. Some inputs may be skipped (e.g. unsupported or polyphonic). This is expected behaviour.
Two files list melody basenames (no paths):
| File | Role |
|---|---|
usable_china.txt |
One basename per line → pool for China. |
usable_europa.txt |
One basename per line → pool for Europe. |
logistic.py uses every China basename and draws a random subset of Europe (random.seed(42), at most 2200 melodies).
With .venv activated:
python logistic.py
python xgbclassifer.py
Rscript factor_logistic.R
python factor_logistic_plot_confusion.py
python comparison.py| Step | Command | What it does |
|---|---|---|
| 1 | python logistic.py |
Builds essen_china_europe_features.csv on first run (this can take a long time due to IDyOM runs). Same stratified train/test and CV as other scripts. Writes Figure 1 (confusion_matrix.pdf) plus coefficient and permutation-importance artefacts. |
| 2 | python xgbclassifer.py |
Needs the features CSV. Same split/features as 1. Writes Figure 2 (xgb_confusion_matrix_test.pdf). |
| 3 | Rscript factor_logistic.R |
EFA on the same numeric features (9 factors, promax, parallel analysis). Writes Figure 3 (factor_eigenvalues_elbow.pdf), factor GLM output, and CSVs consumed by 4. |
| 4 | python factor_logistic_plot_confusion.py |
Reads R’s prediction CSVs. Writes Figure 4 (factor_logistic_confusion_matrix_test.pdf). |
| 5 | python comparison.py |
Needs the features CSV from 1. Builds or loads source_to_csv_columns_with_novel.json, trains one logistic model per implementation source plus an all features baseline, writes comparison CSV/TeX/PDF and coefficients/*.csv. |
First run of Step 1 can take a long time. Later runs load essen_china_europe_features.csv and skip re-extraction unless you delete that file.
| Figure | Output file | Produced by |
|---|---|---|
| 1 | confusion_matrix.pdf |
python logistic.py |
| 2 | xgb_confusion_matrix_test.pdf |
python xgbclassifer.py |
| 3 | factor_eigenvalues_elbow.pdf |
Rscript factor_logistic.R |
| 4 | factor_logistic_confusion_matrix_test.pdf |
python factor_logistic_plot_confusion.py (after 3) |
Rscript factor_logistic.R also writes a self-contained 3D interactive visualization of the eight-factor solution to docs/:
| File | Role |
|---|---|
docs/index.html |
Three.js / 3d-force-graph viewer. |
docs/network_data.js |
Nodes (factors + variables with |loading| > 0.3) and links. |
docs/network_data.json |
Same data as a portable JSON sidecar. |
python build_melody_examples.py populates docs/melody_examples/ with a piano-roll PNG and a synthesized WAV for the 3 highest and 3 lowest-scoring melodies for every feature node and every factor node in the network. Clicking any node then displays these examples.
- Features are ranked by their value in
essen_china_europe_features.csv. - Factors are ranked by the regression factor scores in
factor_scores_for_logreg.csv(produced byfactor_logistic.R). - All of this is precomputed to make the webapp performant.
- Random seeds:
42is fixed in the Python scripts (train_test_split, CV folds, Europe subsample, XGBoost, etc.) and infactor_logistic.R(set.seed(42)). - Same train/test rows across
logistic.py,comparison.py, andxgbclassifer.py— Keeptest_size=0.2and seeds unchanged. - Invalidate the feature cache — Delete
essen_china_europe_features.csvto force re-extraction (e.g. after changingusable_*.txtor upgradingmelody-featuresin a way that affects columns).