Skip to content

Exploratory Analysis of 200K+ song lyrics from the 1 million songs dataset

License

Notifications You must be signed in to change notification settings

edublancas/song-lyrics

Repository files navigation

Song lyrics project

Exploratory Data Analysis and Visualization, Columbia University, Spring 2018.

Project overview

We explored song lyrics data from the Musixmatch + Million Songs dataset to derive conclusions about trends in song lyrics and music across time and geography. We asked questions to explore different facets of the dataset and identified some interesting trends.

Deliverables

The report for this project is available here.

The interactive component, built in d3, allows you to explore data points such as sentiment scores, topic scores and similar artists for the top artists in the One Million Songs + Musixmatch dataset. Click here to view the interactive component.

Folder structure

  • data/ - Data is dumped here, not included in the repository
  • interactive/ - Source code for interactive component
  • experiments/ - Notebooks/scripts that we used to explore the data
  • lib/ - R utility functions used in the project
  • process/ - Scripts for downloading and processing the data (Python 3)
    • process/pkg/ - Python package with utility functions
    • process/clean/ - Cleaning the raw data
    • process/transform/ - Code for generating various song vector representations
    • process/cluster/ - Clustering songs
  • report/ - Report files

Data

We are using the Million Song Dataset, specifically the musiXmatch dataset which contains lyics data for 237,662 tracks.

Quickstart

git clone https://github.com/edublancas/song-lyrics
cd song-lyrics

0. Software requirements

This project requires Python 3 and R.

To install Python and R required packages:

make requirements

1. Get raw data

The following command fetches all the datasets we used, it will create a new data/ folder in the current working directory raw data will be stored in data/raw.

make get_data

Note: GLoVe gives some problems when trying to download it using wget, it's better to download it manually, put the uncompressed data in data/raw.

2. Process data

This script runs all the cleaning, processing we did on the data and it outputs the final datasets we used in the report and the interactive component.

make bootstrap

3. Build report

Build the final report.

make report

About

Exploratory Analysis of 200K+ song lyrics from the 1 million songs dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published