# Exploratory Data Analysis of Music Trends on Spotify

##  1 Aim
The aim of this independent research project is to conduct an exploratory data analysis of music trends using a dataset obtained from Spotify. The project aims to gain insights into the popularity, characteristics, and patterns of music on the platform, and identify factors that may influence music trends.

## 2 Objectives
1. **Identify Key Audio Features**
    Analyze the dataset to identify the key audio features such as danceability, energy, loudness, and valence that play a significant role in defining the characteristics of tracks.
    
2. **Explore Popularity Trends:**
    Investigate the popularity trends within the dataset to understand how popularity varies across different tracks. Analyze the distribution of track popularity and identify any underlying patterns or relationships with other variables.
    
3. **Examine Relationships Between Variables**
    Explore the relationships between different variables such as danceability, energy, and acousticness to identify any correlations or associations that may exist. Investigate how these variables interact with each other and their potential impact on track popularity.
    
4. **Analyze Track Durations**
    Examine the durations of tracks and identify any trends or patterns. Investigate whether track duration has any relationship with popularity or other audio features.
    
5. **Visualize Insights**
    Utilize data visualization techniques to effectively communicate the findings and insights derived from the analysis. Generate visual representations such as histograms, scatter plots, and bar charts to illustrate the relationships and trends discovered in the dataset.
    
6. **Provide Interpretation and Recommendations**
    Interpret the results and findings from the data analysis, drawing meaningful conclusions about the music trends on Spotify. Based on these insights, provide recommendations or suggestions for artists, music industry professionals, or researchers in understanding and leveraging the observed trends.

By achieving these objectives through an exploratory data analysis of the provided Spotify dataset, this research project aims to enhance our understanding of music trends and provide valuable insights into the dynamics of music consumption on the platform.

## 3 Data Relevance and Justification
The chosen dataset obtained from Spotify is relevant to the project brief and aligns with the listed topics of exploring music trends and conducting an exploratory data analysis. Here are the justifications for the data source:

### 3.1 Origin and Acquisition
The dataset used in this research project was obtained from the Spotify Web API, an official API provided by Spotify. The Spotify Web API grants developers access to a vast array of music-related data, including information about tracks, albums, artists, and audio features.

To obtain the dataset, appropriate API techniques were employed to query the Spotify Web API and retrieve the desired information. This involved using Python programming language and libraries specifically designed for interacting with spotipy web API. Through this library, authorized requests were made to the Spotify API, ensuring compliance with Spotify's terms of service and API usage guidelines.

```
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

cid ="YOUR_CLIENT_ID" 
secret = "YOUR_CLIENT_SECRET"

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
```



The API requests were crafted to retrieve the necessary data fields, including artist_name, track_name, popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms, and time_signature. These attributes were selected as they are directly relevant to the research question and objectives, allowing for a comprehensive analysis of music trends on Spotify.

By leveraging the Spotify Web API, the data obtained for this research project is considered authoritative and reliable, ensuring the integrity of the analysis. It adheres to Spotify's terms of service, ensuring ethical data usage and compliance with data access policies. Obtaining the dataset through the official Spotify Web API guarantees the credibility and authenticity of the data, enabling meaningful insights into music trends on the platform.

### 3.2 Appropriateness for Research Question
The dataset obtained from the Spotify Web API is highly suitable for the research question posed, as it provides relevant information for analyzing music trends on the platform. Here's an overview of why this dataset is appropriate: 

#### 3.2.1 Comprehensive Music Information:
The dataset includes artist and track information, allowing for the identification and analysis of specific songs and musicians. This comprehensive music data provides a foundation for exploring music trends and their popularity on Spotify.

#### 3.2.2 Popularity Metrics:
The dataset incorporates popularity metrics that quantify the relative popularity of tracks within the Spotify ecosystem. This metric takes into account factors such as user interactions, play counts, and playlist inclusion, enabling researchers to assess the popularity and prominence of different tracks.

#### 3.2.3 Audio Features:
The dataset includes essential audio features such as danceability, energy, and tempo. These features provide insights into the characteristics of tracks and their potential influence on popularity. Analyzing these audio features can reveal patterns and correlations between specific musical attributes and the success of tracks.

#### 3.2.4 Detailed Music Characteristics:
The dataset offers information on various aspects of music characteristics, such as key, loudness, and mode. These details provide a deeper understanding of the musical elements present in the tracks and allow researchers to explore how these characteristics relate to popularity and genre classification.

By encompassing comprehensive music information, popularity metrics, audio features, and detailed music characteristics, the dataset obtained from the Spotify Web API provides a solid foundation for conducting an exploratory data analysis of music trends on the platform.

### 3.3 Format and Suitability:
The data obtained from the Spotify Web API is provided in a CSV (Comma-Separated Values) format, which is widely recognized and compatible with various data analysis tools and libraries. This format is well-suited for data analysis tasks, including the exploration of music trends on Spotify. Here's a more detailed explanation:

#### 3.3.1 CSV Format:
The dataset is structured in a CSV format, where each row represents a specific track and each column contains the corresponding data attributes. CSV files are plain text files that use commas to separate values, making them highly readable and accessible. This format allows for easy sharing, manipulation, and analysis of the dataset.

#### 3.3.2 Loading into a DataFrame:
Python libraries such as pandas provide functionality to load CSV files into a DataFrame, a tabular data structure that offers flexibility and powerful data manipulation capabilities. By using the pandas library, the CSV data can be imported, organized, and processed in a structured manner. This enables researchers to perform a wide range of numerical and statistical analyses on the dataset.

#### 3.3.3 Data Analysis Capabilities:
The CSV format is well-suited for data analysis tasks, as it allows for efficient handling of large datasets and supports various operations such as filtering, grouping, aggregating, and merging data. With the dataset loaded into a DataFrame, researchers can easily explore the music trends, calculate summary statistics, visualize patterns, and perform advanced analytical techniques.

#### 3.3.4 Integration with Analytical Libraries:
The CSV format's compatibility with Python's analytical libraries, such as numpy, matplotlib, and seaborn, enhances the suitability of the data for analysis. These libraries provide a rich set of tools and functions for data manipulation, visualization, and statistical analysis. Researchers can leverage these libraries to gain deeper insights into music trends, uncover correlations, and visualize the findings.

### 3.4 Consideration of Alternative Datasets:

While the chosen dataset from the Spotify Web API is appropriate for the research topic, it is essential to consider alternative datasets and their potential strengths and weaknesses. Two alternative datasets for the research topic of music trends could be:

- **Billboard Charts Dataset:** Strengths - Provides historical records of popular music across different genres, widely recognized as a measure of mainstream success. Weaknesses - Limited information about audio features and lacks real-time data.<sup>[2]</sup>

- **Music Streaming Service Dataset (e.g., Apple Music or Deezer):** Strengths - Provides comprehensive user listening data, including play counts, skip rates, and user-generated playlists. Weaknesses - Access to such datasets may be restricted, and the availability of specific audio features might vary.<sup>[3]</sup>

Comparing these alternative datasets to the chosen Spotify dataset, it becomes apparent that the Spotify dataset offers a unique advantage. It combines detailed audio features, popularity metrics, and a large user base, making it well-suited for exploring music trends comprehensively.

By selecting the Spotify dataset, this research project can leverage its rich features and extensive coverage to gain deep insights into music trends on the platform, enhancing the overall quality and relevance of the analysis.

### 3.5 Ethics of Data Usage:

The ethical considerations surrounding the use of data in this analysis have been carefully addressed. The following aspects have been taken into account:

#### 3.5.1 Data Source and Provenance:

- The dataset used in this analysis was obtained from the Spotify Web API, which is an official API provided by Spotify. The data has been acquired through legitimate means and complies with Spotify's terms of service.
- The dataset is considered proprietary as it is specific to Spotify's platform and their user data. Proper licensing and usage agreements have been followed to ensure compliance with legal and ethical requirements.
- The provenance of the data, including its origin and acquisition techniques, has been clearly described, ensuring transparency and accountability in data usage.

#### 3.5.2 Data Usage and Intellectual Property:

- The analysis conducted on the dataset does not aim to create new forms of intellectual property. It is purely an exploratory data analysis aimed at understanding music trends on Spotify.
- Attribution to the data source (Spotify) is given by acknowledging the use of the Spotify Web API and providing appropriate citations in the project documentation.

#### 3.5.3 Implications and Potential Harm:

- Consideration has been given to the potential implications of utilizing the data for this analysis. Steps have been taken to ensure that the analysis does not have the power to discriminate or produce dangerous or harmful assumptions.
- Any conclusions or findings drawn from the analysis are based on the statistical analysis of aggregated data, and care has been taken to avoid making assumptions that could perpetuate harmful stereotypes or biases.

#### 3.5.4 Data Processing Pipeline and Anonymization:

- The data used in the analysis is stored and accessed within a Jupyter Notebook, ensuring that it remains securely contained and accessible only to authorized individuals.
- Personal identifiable information has been removed or anonymized to protect the privacy and confidentiality of individuals.
- The data processing steps performed on the dataset are clearly documented, providing transparency in the analysis pipeline and ensuring that no personally identifiable distinctions can be made.

#### 3.5.5 Consideration of Dataset Biases:

Potential biases in the dataset, such as demographic imbalances, have been considered. The analysis takes into account the limitations and potential biases inherent in the dataset to ensure that the findings are interpreted and generalized appropriately.

## 4 Project Background

The project's background has been developed based on relevant literature and the need to explore music trends using data analysis techniques. Here's a summary of the project background:

### 4.1 Interest and Relevance
The field of music trends analysis holds significant interest and relevance due to the widespread availability of digital music platforms, such as Spotify, and their impact on music consumption. Understanding the preferences, patterns, and factors influencing music trends can benefit music industry professionals, artists, and researchers, enabling them to make informed decisions and create impactful music experiences.

### 4.2 Unexplored Research Questions
Although music trends have been studied to some extent, the specific research questions posed in this project have not been previously explored comprehensively. The project aims to address gaps in the literature and provide fresh insights into the relationship between music genres, audio features, and popularity trends on Spotify.

### 4.3 Scope of Work
The project will focus on analyzing specific aspects of music trends, including the identification of what makes a song popular, exploration of audio features, temporal analysis of popularity trends, and investigation of correlations between track characteristics and popularity. While the scope encompasses these areas, the project will not delve into external factors such as social media impact or detailed user demographics.

### 4.4 Analytical Data Processing Pipeline
The analytical data processing pipeline for this project involves several stages:

1. Data Acquisition: Obtain the dataset from the Spotify Web API, ensuring compliance with API usage guidelines.
2. Data Cleaning: Clean the dataset by handling missing values, removing duplicates, and addressing any inconsistencies or errors.
4. Exploratory Data Analysis: Perform exploratory analysis, including descriptive statistics, visualizations, and preliminary insights into music trends.
5. Statistical Analysis: Conduct statistical analyses to identify patterns, correlations, and trends in music genres, audio features, and popularity.
6. Evaluation and Interpretation: Evaluate the findings based on the project's aims and objectives, interpreting the results in the context of music trends on Spotify.

### 4.5 Evaluation of Aims and Objectives:
The aims and objectives of the project will be evaluated based on the chosen approach, which involves rigorous data exploration, statistical analyses, and interpretation of findings. The evaluation will involve assessing the extent to which the research questions have been addressed, determining the significance and relevance of the insights gained, and evaluating the project's contribution to the understanding of music trends on Spotify.

## 5 Data Acquisition and Cleaning<sup>[1]</sup>

### Note on Code Presentation:

To ensure data security, sensitive credentials have been protected, and certain code blocks have been presented as pseudocode. The pseudocode represents the structure and logic of the code without revealing actual credentials. It is vital to keep these credentials confidential to prevent unauthorized access. References to the Spotipy library and Spotify Web API documentation validate adherence to recommended guidelines. The subsequent explanation will provide insights into the methodology used for data retrieval, processing, and analysis.

### 5.1 About the Spotipy Library:

From the [official Spotipy docs](https://spotipy.readthedocs.io/en/latest/): 
>"Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform."


### 5.2 About using the Spotify Web API:

Spotify offers a number of [API endpoints](https://beta.developer.spotify.com/documentation/web-api/reference/) to access the Spotify data. In this notebook, I used the following:

### 5.3 Setting up Spotipy Client
The provided code is all that's needed to establish Spotipy for querying the API endpoint. Additional comprehensive instructions can be found in the [official documentation].(https://spotipy.readthedocs.io/en/latest/#installation).

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

cid ="96515db9e4454af9b233f4b45c31549f" 
secret = "d75d274bf9994f4f9168ca7f08d61752"

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

### 5.4 Get the Track ID Data

The process of collecting data is split into two components: track IDs and audio features. Now, we will proceed with obtaining 10,000 track IDs from the Spotify API.

In [None]:
# timeit library to measure the time needed to run this code
import timeit
start = timeit.default_timer()

# create empty lists where the results are going to be stored
artist_name = []
track_name = []
popularity = []
track_id = []

for i in range(900):
    track_results = sp.search(q='year:2023', type='track', limit=50,offset=i)
    for i, t in enumerate(track_results['tracks']['items']):
        artist_name.append(t['artists'][0]['name'])
        track_name.append(t['name'])
        track_id.append(t['id'])
        popularity.append(t['popularity'])
      

stop = timeit.default_timer()
print ('Time to run this code (in seconds):', stop - start)

### 5.5 Prepare Track ID Data for Analysis 

In the upcoming cells, we will conduct exploratory data analysis and prepare the recently acquired data.

Firstly, let's check number of tracks in the data we obtained.

In [None]:
print('number of tracks in the track_id list:', len(track_id))

Now, let's load the lists into a dataframe:

In [None]:
import pandas as pd

df_tracks = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id,'popularity':popularity})
print(df_tracks.shape)
df_tracks.head()

In [None]:
df_tracks.info()

Occasionally, the same track can have multiple track IDs, such as when it appears as a single or as part of an album.

We need to examine this situation and rectify it if necessary.

In [None]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

In the dataset, there are duplicate entries that will be removed in the following step:

In [None]:
df_tracks.drop_duplicates(subset=['artist_name','track_name'], inplace=True)

Let's recheck dataset once again to make sure that all the duplicates are removed.

In [None]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

Let's assess the number of remaining tracks now:

In [None]:
df_tracks.shape