# Project Part 1: Predicting Song Popularity Score

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/cdinh92/CS39AA-project/blob/main/project_part1.ipynb)

Welcome to the data science project undertaken for the CS39AA NLP Machine Learning class at MSU Denver. In this exploration, the aim is to delve into the world of music industry and investigate whether a predictive model can be designed to forecast the success of songs based on the popularity scores. The focus of this analysis lies on 8 key song features: danceability, energy, mode, loudness, speechiness, instrumentalness, tempo, and valence.

## 1. Introduction/Background

In this project, I will input almost 30,000 songs from the Spotify API provided by Joakim Arvidsson on [Kaggle](https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs). This is an 23-column database with 32833 popular songs mostly from the 2010s to 2020. I will use only 8 features mentioned above as well as 8 corresponding columns in the database to train the model and test the accuracy of the result.

**8 Key Feature Definitions:**
_(Use the same column's names from the dataset)_

| Feature | Description |
| --- | ----------- |
| **track_popularity** | Song Popularity (0-100) where higher is better |
| mode | indicates the modality (major or minor) of a track. Major is represented by 1 and minor is 0 |
| danceability | Danceability describes how suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | Scale (0.0-1.0). Energetic tracks  feel fast, loud, and noisy. |
| loudness | The overall loudness of a track in decibels (dB) (-60 and 0 db). |
| speechiness | Speechiness detects the presence of spoken words in a track. Values above 0.66 describe tracks that are probably made entirely of spoken words |
| instrumentalness | Values above 0.5 are intended to represent instrumental tracks. |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). |
| valence | The higher the value, the more positive mood for the song |

**Initial Prediction:**
_Anticipating the factors that contribute to a song's popularity is a complex task. I guess that danceability, energy, tempo, and particularly valence might emerge as key components influencing a song's popularity scores. However, it is crucial to acknowledge the broader context of the music industry, where artist names and the strategic marketing by big companies often play much more significant influence. The dynamics of the industry mean that independent artists may face unique challenges in breaking into mainstream charts._

**Alternative Approach:**
_While recognizing the potential limitations of predictive models in capturing the entirety of a song's success factors, this project might also seeks to identify common features among trending songs, just in case the predictive results are far from expectations._

**ABOUT THE PREDICTING MODEL**

_The model follows the concept of **regression** where a model explores the relationship between a variety of diverse features and a desired outcome. In this case, the dataset consists of input features (valence, mode, energy,...) and corresponding output targets (the actual popularity scores of the songs). The goal is to train the model to make accurate predictions based on new inputs._

_The model then will be used to predict the popularity scores from another dataset: Spotify top 50 songs in 2021 on [Kaggle](https://www.kaggle.com/datasets/equinxx/spotify-top-50-songs-in-2021). This will be the primary test to check how accurate the predicting model performs on recent music taste._

## 2. Exploratory Data Analysis

Let's explore the dataset and see if we could trim the dataset and eliminate irrelevant columns.

In [1]:
# import all of the python modules/packages you'll need here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# read the csv file
df = pd.read_csv('/kaggle/input/30000-spotify-songs/spotify_songs.csv')
df.info()


ModuleNotFoundError: No module named 'pandas'