# COGS 108 - Project Proposal

# Names

- Arushi Munjal
- Siya Randhawa
- Zoya Hasan
- Shruti Yamala

# Research Question

- **Research Question**: How do specific lyrical themes correlate with the popularity scores of Spotify English songs across major genres—such as pop, rock, hip-hop, and country—from 2013 to 2023? Are there specific lyrical themes that consistently correlate with high popularity scores across multiple genres on Spotify, or are certain themes predominantly associated with popularity within individual genres?

## Background and Prior Work

Streaming platforms like Spotify have reshaped the music landscape, changing not only how listeners access music but also how data about listener preferences, song popularity, and music trends is gathered and analyzed. Knowing what makes a song popular can offer valuable insights for artists and producers aiming to create music that resonates widely. Among the factors that influence a song’s popularity—like the artist’s profile and musical structure—lyrics stand out as a direct way to express emotions and themes that connect with listeners. This project will explore how specific lyrical themes align with popularity across different genres, spotlighting trends that strike a chord with Spotify audiences.

Previous research has delved into the role of lyrical sentiment in predicting popularity and classifying genres. For instance, a study published in *Ultimatics: Jurnal Teknik Informatika* used a BERT-based model to analyze English song lyrics, predicting song popularity through sentiment analysis with an 87% accuracy rate, employing techniques like oversampling and data preprocessing. This study found that lyrical sentiment—whether positive or negative—strongly influenced popularity, with BERT capturing lyrical emotions linked to audience engagement and song performance.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Another approach was undertaken by Boonyanit and Dahl at Stanford, who aimed to classify songs into genres based solely on lyrical content, using GloVe embeddings and LSTM models to predict genre with an accuracy of 68% at its peak. Their work illuminated the capacity of lyrics to signal genre-related characteristics, especially in distinguishing unique words and recurring themes. By focusing on genre classification, this study underscored how lyrical content often aligns with genre conventions, revealing differences in word choice and thematic style across genres like hip-hop, pop, and rock, despite overlaps.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

Building on these studies, our project diverges by focusing not on predicting popularity or genre independently but on understanding how specific themes within lyrics correlate with popularity across genres like pop, hip-hop, rock, and country. We are not merely classifying songs by sentiment or genre; rather, we are examining genre as a contextual factor in lyrical themes. This will allow us to determine which themes—such as love, nostalgia, or resilience—drive higher engagement in particular genres, offering insights into the preferences of genre-specific audiences. Our findings can aid musicians in tailoring lyrics to align with listener tastes, leveraging data to enhance song impact on streaming platforms.

# Hypothesis


We predict the presence of specific lyrical themes in Spotify songs, such as heartbreak and love, will correlate more strongly with popularity in genres such as Pop and Indie, while lyrical themes such as empowerment will correlate more strongly with popularity in genres such as Rap. We hypothesized this because of general knowledge of the sentiment of lyrics in popular songs with genres like Pop, Indie, and Rap. Certain popular songs tend to have lyrics that correlate towards a theme like heartbreak, love, and empowerment.

# Data

- **Ideal Dataset**

    Variables: Song Information (Title, artist, release date, genre), Lyrical Content, Popularity Metrics (Spotify popularity score, chart position, streaming counts), Temporal and Geographic Data (Year, region)
    Observations: At least 100,000 songs from 2013-2023 across genres to ensure variety and relevance
    Collection Method: Song Metadata (Spotify API for genre, popularity score, release dates), Lyrics (Genius API or Musixmatch), Popularity and Chart Data (Billboard for mainstream popularity metrics)
    Storage: CSV files in local or GitHub repo for already structured data/data pulled from API

- **Real Dataset**

    Variables: The dataset will include song metadata such as Title, Artist, Release Date, Genre, Lyrics, and Popularity Metrics (Spotify Popularity Score, Chart Position, Streaming Counts), along with Temporal and Geographic Data (Year, Region) to contextualize each song's release and popularity trajectory.
    Observations: This dataset comprises at least 100,000 songs spanning multiple genres—pop, rock, hip-hop, country, and more—from 2013 to 2023. This timeframe ensures that the data is both diverse and relevant to current listening trends. 
    Collection Method: Song metadata, including genre, release dates, and Spotify popularity scores, will be sourced via the Spotify API, while lyrics are collected using the Genius or Musixmatch API. Popularity and chart data are integrated from Billboard to reflect mainstream success metrics.
    Storage: Data will be stored in CSV files, either locally or on a GitHub repository, depending on the API data's structure.

- **Potential Datasets**

    - Audio features and lyrics of Spotify songs: [Kaggle Link](https://www.kaggle.com/datasets/imuhammad/audio-features-and-lyrics-of-spotify-songs)
    - Musixmatch Dataset: [Kaggle Link](https://www.kaggle.com/datasets/abhishek/musicxmatch-dataset)
    - Music Dataset: Song Information and Lyrics: [Kaggle Link](https://www.kaggle.com/datasets/suraj520/music-dataset-song-information-and-lyrics)

To answer the proposed research questions, both datasets will be used to explore correlations between lyrical themes and song popularity across different genres. The ideal dataset provides comprehensive variables that allow a wide-ranging analysis of song features and their influence on popularity. With the real dataset’s structured information from APIs, we can apply machine learning techniques to extract and categorize lyrical themes, enabling a closer look at which types of content align with audience preferences. Together, these datasets will allow us to uncover patterns within lyrical themes that correlate with higher popularity scores, analyzing how these patterns differ by genre and over time. This approach provides insights into genre-specific listener preferences and how lyrical content impacts song success across years.

# Ethics & Privacy

As we analyzed the datasets, we found a number of ethical issues that are crucial to the data science process, especially with relation to privacy and biases. Although the dataset used—such as the  Audio features and lyrics of Spotify songs.. The possible bias in representation is one of the main issues. The Spotify songs dataset, for instance, has over 18,000 tracks, but it primarily features well-known songs and mainstream performers, possibly ignoring the variety of musical genres and up-and-coming musicians. This prejudice may result in inaccurate inferences regarding the topics and emotional components that are common in popular music lyrics.
The terms of use for each dataset should be taken into account when collecting data. Copyright regulations must be followed because the lyrics in the dataset are protected by intellectual property laws, their use is restricted to scholarly and non-commercial reasons. We'll make sure that our analysis complies with these restrictions and doesn't improperly reveal copyrighted content. We want to carry out exploratory data analysis at several levels in order to address the biases that have been detected and make sure that hidden factors that can affect our discoveries are carefully taken into account. To find any notable biases, we will first look at the datasets' demographic representation. To make sure our results do not unduly favor popular music, we will analyze whether characteristics, such as lyrical sentiment, correspond with popularity across a variety of genres.
To ensure responsible data analysis, we address a number of ethical issues while using the Audio Features and Lyrics of Spotify Songs dataset. We follow fair use rules by concentrating on broad subject trends rather than directly quoting copyrighted lyrics. Furthermore, our study respects the dataset's original use and is solely academic and non-commercial. We will examine more general trends across genres rather than assigning particular themes to specific performers because we acknowledge that lyrics may contain sensitive or personal material. Without disclosing private information, this method preserves the integrity of the analysis. Even if the lyrics are openly accessible, we take care of any personally identifiable information in a responsible manner to prevent invasive or speculative conclusions about the artists based just on the lyrics. By looking at a range of genres and recognizing constraints, we mitigate potential genre biases despite the dataset's emphasis on popular music. Our objective is to advance an inclusive perspective on musical trends and themes.
An additional ethical consideration pertains to the findings' equitable impact. We need to be careful about how our analysis might affect how people view certain musical genres or performers, especially if our results imply that some genres are more popular than others. We aim to promote a more inclusive understanding of music data and its implications by being conscious of potential biases and having an open discussion about them. It should be noted that the goal of our project is to determine a connection between the popularity of songs and their lyrical themes, or messages of songs. Our ultimate objective is to maintain moral principles in our study, favorably advancing our knowledge of the relationship of lyrical themes to music popularity while honoring the rights and talents of all participating artists.

# Team Expectations

Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

- **Team Expectation 1: Structure of Meetings and Communication**
    - Weekly meetings will be held on Tuesdays at 6 PM on Zoom, and we will communicate by iMessage.

- **Team Expectation 2: Role Assignment and Task Distribution**
    - With frequent updates in a shared Google Spreadsheet, tasks will be assigned according to individual strengths. If a team member is having trouble with a task, they should notify the others at least three days before the due date so that any required modifications can be made.

- **Team Expectation 3: Resolving Conflicts and Making Decisions**
    - Major choices will need unanimous approval, but minor ones will be decided by a majority vote. The member with the most appropriate experience will be given preference for each specialized task, and any outstanding conflicts will be settled by week seven if needed.

- **Team Expectation 4: Accountability and Progress Monitoring**
    - Each group member will report their task status in the shared Google Spreadsheet at least once every week to ensure steady progress. Members will provide a quick update on their completed work and any obstacles they are encountering during each weekly meeting.

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/29        | 1 PM         | Review COGS108 expectations and brainstorm initial project ideas on music analysis | Decide on communication methods, finalize the research question on lyrical themes and popularity scores, and outline initial themes to analyze (e.g., love, empowerment, heartbreak). |
| 11/5         | 10 AM        | Conduct background research on available datasets and methods for lyrical analysis | Discuss potential datasets (e.g., Genius for lyrics, Spotify API for popularity scores), consider ethical implications, and draft the project proposal. |
| 11/13        | 10 AM        | Finalize and submit the project proposal; identify relevant datasets covering 2013-2023 | Plan data wrangling steps for extracting lyrical themes and popularity scores, assign group roles, and set responsibilities for specific tasks. |
| 11/19        | 6 PM         | Import and preprocess data for lyrical themes and popularity scores from Spotify and other sources | Review data wrangling progress and discuss exploratory data analysis (EDA) to understand genre and theme distributions. |
| 11/27        | 12 PM        | Complete data wrangling and initial EDA; begin analyzing lyrical themes and popularity across genres | Review preliminary analysis results, make adjustments as needed, and perform a mid-project check-in. |
| 12/3         | 12 PM        | Complete final analysis; draft results, conclusions, and discussion sections | Review and edit the full project, finalizing insights on the relationship between lyrical themes and song popularity across genres from 2013 to 2023. |
| 12/11        | Before 11:59 PM | N/A | Submit the final project and complete group project surveys. |

## Footnotes

1. <a name="cite_note-1"></a> Agatha, H., Putri, F. P., & Suryadibrata, A. (2023). Sentiment Analysis on Song Lyrics for Song Popularity Prediction Using BERT Algorithm. *Ultimatics: Jurnal Teknik Informatika*. Retrieved from [https://ejournals.umn.ac.id/index.php/TI/article/view/3420/1509](https://ejournals.umn.ac.id/index.php/TI/article/view/3420/1509)
2. <a name="cite_note-2"></a> Boonyanit, A., & Dahl, A. (2023). Music Genre Classification using Song Lyrics. *Stanford CS224N Custom Project*. Retrieved from [https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/reports/final_reports/report003.pdf](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/reports/final_reports/report003.pdf)