# Exploratory Data Analysis of Music Trends on Spotify

##  1 Aim
The aim of this independent research project is to conduct an exploratory data analysis of music trends using a dataset obtained from Spotify. The project aims to gain insights into the popularity, characteristics, and patterns of music on the platform, and identify factors that may influence music trends.

## 2 Objectives
1. **Identify Key Audio Features**
    Analyze the dataset to identify the key audio features such as danceability, energy, loudness, and valence that play a significant role in defining the characteristics of tracks.
    
2. **Explore Popularity Trends:**
    Investigate the popularity trends within the dataset to understand how popularity varies across different tracks. Analyze the distribution of track popularity and identify any underlying patterns or relationships with other variables.
    
3. **Examine Relationships Between Variables**
    Explore the relationships between different variables such as danceability, energy, and acousticness to identify any correlations or associations that may exist. Investigate how these variables interact with each other and their potential impact on track popularity.
    
4. **Analyze Track Durations**
    Examine the durations of tracks and identify any trends or patterns. Investigate whether track duration has any relationship with popularity or other audio features.
    
5. **Visualize Insights**
    Utilize data visualization techniques to effectively communicate the findings and insights derived from the analysis. Generate visual representations such as histograms, scatter plots, and bar charts to illustrate the relationships and trends discovered in the dataset.
    
6. **Provide Interpretation and Recommendations**
    Interpret the results and findings from the data analysis, drawing meaningful conclusions about the music trends on Spotify. Based on these insights, provide recommendations or suggestions for artists, music industry professionals, or researchers in understanding and leveraging the observed trends.

By achieving these objectives through an exploratory data analysis of the provided Spotify dataset, this research project aims to enhance our understanding of music trends and provide valuable insights into the dynamics of music consumption on the platform.

## 3 Data Relevance and Justification
The chosen dataset obtained from Spotify is relevant to the project brief and aligns with the listed topics of exploring music trends and conducting an exploratory data analysis. Here are the justifications for the data source:

### 3.1 Origin and Acquisition
The dataset used in this research project was obtained from the Spotify Web API, an official API provided by Spotify. The Spotify Web API grants developers access to a vast array of music-related data, including information about tracks, albums, artists, and audio features.

To obtain the dataset, appropriate API techniques were employed to query the Spotify Web API and retrieve the desired information. This involved using Python programming language and libraries specifically designed for interacting with spotipy web API. Through this library, authorized requests were made to the Spotify API, ensuring compliance with Spotify's terms of service and API usage guidelines.

```
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

cid ="YOUR_CLIENT_ID" 
secret = "YOUR_CLIENT_SECRET"

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
```



The API requests were crafted to retrieve the necessary data fields, including artist_name, track_name, popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms, and time_signature. These attributes were selected as they are directly relevant to the research question and objectives, allowing for a comprehensive analysis of music trends on Spotify.

By leveraging the Spotify Web API, the data obtained for this research project is considered authoritative and reliable, ensuring the integrity of the analysis. It adheres to Spotify's terms of service, ensuring ethical data usage and compliance with data access policies. Obtaining the dataset through the official Spotify Web API guarantees the credibility and authenticity of the data, enabling meaningful insights into music trends on the platform.

### 3.2 Appropriateness for Research Question
The dataset obtained from the Spotify Web API is highly suitable for the research question posed, as it provides relevant information for analyzing music trends on the platform. Here's an overview of why this dataset is appropriate: 

#### 3.2.1 Comprehensive Music Information:
The dataset includes artist and track information, allowing for the identification and analysis of specific songs and musicians. This comprehensive music data provides a foundation for exploring music trends and their popularity on Spotify.

#### 3.2.2 Popularity Metrics:
The dataset incorporates popularity metrics that quantify the relative popularity of tracks within the Spotify ecosystem. This metric takes into account factors such as user interactions, play counts, and playlist inclusion, enabling researchers to assess the popularity and prominence of different tracks.

#### 3.2.3 Audio Features:
The dataset includes essential audio features such as danceability, energy, and tempo. These features provide insights into the characteristics of tracks and their potential influence on popularity. Analyzing these audio features can reveal patterns and correlations between specific musical attributes and the success of tracks.

#### 3.2.4 Detailed Music Characteristics:
The dataset offers information on various aspects of music characteristics, such as key, loudness, and mode. These details provide a deeper understanding of the musical elements present in the tracks and allow researchers to explore how these characteristics relate to popularity and genre classification.

By encompassing comprehensive music information, popularity metrics, audio features, and detailed music characteristics, the dataset obtained from the Spotify Web API provides a solid foundation for conducting an exploratory data analysis of music trends on the platform.

### 3.3 Format and Suitability:
The data obtained from the Spotify Web API is provided in a CSV (Comma-Separated Values) format, which is widely recognized and compatible with various data analysis tools and libraries. This format is well-suited for data analysis tasks, including the exploration of music trends on Spotify. Here's a more detailed explanation:

#### 3.3.1 CSV Format:
The dataset is structured in a CSV format, where each row represents a specific track and each column contains the corresponding data attributes. CSV files are plain text files that use commas to separate values, making them highly readable and accessible. This format allows for easy sharing, manipulation, and analysis of the dataset.

#### 3.3.2 Loading into a DataFrame:
Python libraries such as pandas provide functionality to load CSV files into a DataFrame, a tabular data structure that offers flexibility and powerful data manipulation capabilities. By using the pandas library, the CSV data can be imported, organized, and processed in a structured manner. This enables researchers to perform a wide range of numerical and statistical analyses on the dataset.

#### 3.3.3 Data Analysis Capabilities:
The CSV format is well-suited for data analysis tasks, as it allows for efficient handling of large datasets and supports various operations such as filtering, grouping, aggregating, and merging data. With the dataset loaded into a DataFrame, researchers can easily explore the music trends, calculate summary statistics, visualize patterns, and perform advanced analytical techniques.

#### 3.3.4 Integration with Analytical Libraries:
The CSV format's compatibility with Python's analytical libraries, such as numpy, matplotlib, and seaborn, enhances the suitability of the data for analysis. These libraries provide a rich set of tools and functions for data manipulation, visualization, and statistical analysis. Researchers can leverage these libraries to gain deeper insights into music trends, uncover correlations, and visualize the findings.

### 3.4 Consideration of Alternative Datasets:

While the chosen dataset from the Spotify Web API is appropriate for the research topic, it is essential to consider alternative datasets and their potential strengths and weaknesses. Two alternative datasets for the research topic of music trends could be:

- **Billboard Charts Dataset:** Strengths - Provides historical records of popular music across different genres, widely recognized as a measure of mainstream success. Weaknesses - Limited information about audio features and lacks real-time data.<sup>[2]</sup>

- **Music Streaming Service Dataset (e.g., Apple Music or Deezer):** Strengths - Provides comprehensive user listening data, including play counts, skip rates, and user-generated playlists. Weaknesses - Access to such datasets may be restricted, and the availability of specific audio features might vary.<sup>[3]</sup>

Comparing these alternative datasets to the chosen Spotify dataset, it becomes apparent that the Spotify dataset offers a unique advantage. It combines detailed audio features, popularity metrics, and a large user base, making it well-suited for exploring music trends comprehensively.

By selecting the Spotify dataset, this research project can leverage its rich features and extensive coverage to gain deep insights into music trends on the platform, enhancing the overall quality and relevance of the analysis.

### 3.5 Ethics of Data Usage:

The ethical considerations surrounding the use of data in this analysis have been carefully addressed. The following aspects have been taken into account:

#### 3.5.1 Data Source and Provenance:

- The dataset used in this analysis was obtained from the Spotify Web API, which is an official API provided by Spotify. The data has been acquired through legitimate means and complies with Spotify's terms of service.
- The dataset is considered proprietary as it is specific to Spotify's platform and their user data. Proper licensing and usage agreements have been followed to ensure compliance with legal and ethical requirements.
- The provenance of the data, including its origin and acquisition techniques, has been clearly described, ensuring transparency and accountability in data usage.

#### 3.5.2 Data Usage and Intellectual Property:

- The analysis conducted on the dataset does not aim to create new forms of intellectual property. It is purely an exploratory data analysis aimed at understanding music trends on Spotify.
- Attribution to the data source (Spotify) is given by acknowledging the use of the Spotify Web API and providing appropriate citations in the project documentation.

#### 3.5.3 Implications and Potential Harm:

- Consideration has been given to the potential implications of utilizing the data for this analysis. Steps have been taken to ensure that the analysis does not have the power to discriminate or produce dangerous or harmful assumptions.
- Any conclusions or findings drawn from the analysis are based on the statistical analysis of aggregated data, and care has been taken to avoid making assumptions that could perpetuate harmful stereotypes or biases.

#### 3.5.4 Data Processing Pipeline and Anonymization:

- The data used in the analysis is stored and accessed within a Jupyter Notebook, ensuring that it remains securely contained and accessible only to authorized individuals.
- Personal identifiable information has been removed or anonymized to protect the privacy and confidentiality of individuals.
- The data processing steps performed on the dataset are clearly documented, providing transparency in the analysis pipeline and ensuring that no personally identifiable distinctions can be made.

#### 3.5.5 Consideration of Dataset Biases:

Potential biases in the dataset, such as demographic imbalances, have been considered. The analysis takes into account the limitations and potential biases inherent in the dataset to ensure that the findings are interpreted and generalized appropriately.

## 4 Project Background

The project's background has been developed based on relevant literature and the need to explore music trends using data analysis techniques. Here's a summary of the project background:

### 4.1 Interest and Relevance
The field of music trends analysis holds significant interest and relevance due to the widespread availability of digital music platforms, such as Spotify, and their impact on music consumption. Understanding the preferences, patterns, and factors influencing music trends can benefit music industry professionals, artists, and researchers, enabling them to make informed decisions and create impactful music experiences.

### 4.2 Unexplored Research Questions
Although music trends have been studied to some extent, the specific research questions posed in this project have not been previously explored comprehensively. The project aims to address gaps in the literature and provide fresh insights into the relationship between music genres, audio features, and popularity trends on Spotify.

### 4.3 Scope of Work
The project will focus on analyzing specific aspects of music trends, including the identification of what makes a song popular, exploration of audio features, temporal analysis of popularity trends, and investigation of correlations between track characteristics and popularity. While the scope encompasses these areas, the project will not delve into external factors such as social media impact or detailed user demographics.

### 4.4 Analytical Data Processing Pipeline
The analytical data processing pipeline for this project involves several stages:

1. Data Acquisition: Obtain the dataset from the Spotify Web API, ensuring compliance with API usage guidelines.
2. Data Cleaning: Clean the dataset by handling missing values, removing duplicates, and addressing any inconsistencies or errors.
4. Exploratory Data Analysis: Perform exploratory analysis, including descriptive statistics, visualizations, and preliminary insights into music trends.
5. Statistical Analysis: Conduct statistical analyses to identify patterns, correlations, and trends in music genres, audio features, and popularity.
6. Evaluation and Interpretation: Evaluate the findings based on the project's aims and objectives, interpreting the results in the context of music trends on Spotify.

### 4.5 Evaluation of Aims and Objectives:
The aims and objectives of the project will be evaluated based on the chosen approach, which involves rigorous data exploration, statistical analyses, and interpretation of findings. The evaluation will involve assessing the extent to which the research questions have been addressed, determining the significance and relevance of the insights gained, and evaluating the project's contribution to the understanding of music trends on Spotify.

## 5 Data Acquisition and Cleaning<sup>[1]</sup>

### Note on Code Presentation:

To ensure data security, sensitive credentials have been protected, and certain code blocks have been presented as pseudocode. The pseudocode represents the structure and logic of the code without revealing actual credentials. It is vital to keep these credentials confidential to prevent unauthorized access. References to the Spotipy library and Spotify Web API documentation validate adherence to recommended guidelines. The subsequent explanation will provide insights into the methodology used for data retrieval, processing, and analysis.

### 5.1 About the Spotipy Library:

From the [official Spotipy docs](https://spotipy.readthedocs.io/en/latest/): 
>"Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform."


### 5.2 About using the Spotify Web API:

Spotify offers a number of [API endpoints](https://beta.developer.spotify.com/documentation/web-api/reference/) to access the Spotify data. In this notebook, I used the following:

### 5.3 Setting up Spotipy Client
The provided code is all that's needed to establish Spotipy for querying the API endpoint. Additional comprehensive instructions can be found in the [official documentation].(https://spotipy.readthedocs.io/en/latest/#installation).

In [4]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

cid ="96515db9e4454af9b233f4b45c31549f" 
secret = "d75d274bf9994f4f9168ca7f08d61752"

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

### 5.4 Get the Track ID Data

The process of collecting data is split into two components: track IDs and audio features. Now, we will proceed with obtaining 10,000 track IDs from the Spotify API.

In [5]:
# timeit library to measure the time needed to run this code
import timeit
start = timeit.default_timer()

# create empty lists where the results are going to be stored
artist_name = []
track_name = []
popularity = []
track_id = []

for i in range(900):
    track_results = sp.search(q='year:2023', type='track', limit=50,offset=i)
    for i, t in enumerate(track_results['tracks']['items']):
        artist_name.append(t['artists'][0]['name'])
        track_name.append(t['name'])
        track_id.append(t['id'])
        popularity.append(t['popularity'])
      

stop = timeit.default_timer()
print ('Time to run this code (in seconds):', stop - start)

Time to run this code (in seconds): 397.7484246669992


### 5.5 Prepare Track ID Data for Analysis 

In the upcoming cells, we will conduct exploratory data analysis and prepare the recently acquired data.

Firstly, let's check number of tracks in the data we obtained.

In [6]:
print('number of tracks in the track_id list:', len(track_id))

number of tracks in the track_id list: 45000


Now, let's load the lists into a dataframe:

In [8]:
import pandas as pd

df_tracks = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id,'popularity':popularity})
print(df_tracks.shape)
df_tracks.head()

(45000, 4)


Unnamed: 0,artist_name,track_name,track_id,popularity
0,Olivia Rodrigo,vampire,3k79jB4aGmMDUQzEwa46Rz,98
1,Luke Combs,Fast Car,1Lo0QY9cvc8sUB2vnIOxDT,92
2,Gunna,fukumean,4rXLjWdF2ZZpXCVTfWcshS,94
3,Eslabon Armado,Ella Baila Sola,3dnP0JxCgygwQH9Gm7q7nb,97
4,Grupo Frontera,un x100to,6pD0ufEQq0xdHSsRbg9LBK,99


In [9]:
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45000 entries, 0 to 44999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   artist_name  45000 non-null  object
 1   track_name   45000 non-null  object
 2   track_id     45000 non-null  object
 3   popularity   45000 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 1.4+ MB


Occasionally, the same track can have multiple track IDs, such as when it appears as a single or as part of an album.

We need to examine this situation and rectify it if necessary.

In [10]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

artist_name    44999
track_name     44999
track_id       44999
popularity     44999
dtype: int64

In the dataset, there are duplicate entries that will be removed in the following step:

In [11]:
df_tracks.drop_duplicates(subset=['artist_name','track_name'], inplace=True)

Let's recheck dataset once again to make sure that all the duplicates are removed.

In [12]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

artist_name    0
track_name     0
track_id       0
popularity     0
dtype: int64

Let's assess the number of remaining tracks now:

In [13]:
df_tracks.shape

(844, 4)

### 5.6 Getting Audio Features Data
Now, let's utilize the audio features endpoint to retrieve the audio features data for track IDs.

It's important to note that this endpoint has a limitation of accepting a maximum of 100 track IDs per query.

To overcome this limitation, let's implement a nested for loop. The outer loop collects track IDs in batches of size 100, while the inner loop performs the query and appends the results to the rows list.

Additionally, we need to include a check to handle cases where a track ID didn't return any audio features (i.e., None was returned), as this would cause issues during the process.

In [14]:
# again measuring the time
start = timeit.default_timer()

# empty list, batchsize and the counter for None results
rows = []
batchsize = 100
None_counter = 0

for i in range(0,len(df_tracks['track_id']),batchsize):
    batch = df_tracks['track_id'][i:i+batchsize]
    feature_results = sp.audio_features(batch)
    for i, t in enumerate(feature_results):
        if t == None:
            None_counter = None_counter + 1
        else:
            rows.append(t)
            
print('Number of tracks where no audio features were available:',None_counter)

stop = timeit.default_timer()
print ('Time to run this code (in seconds):',stop - start)

Number of tracks where no audio features were available: 0
Time to run this code (in seconds): 1.5532145830002264


### 5.7 Preparing Audio Features Data for Analysis 
Like previous dataset, we will conduct exploratory and prepare audio features data for analysis.

Firstly, let's check number of tracks in the data we obtained.

In [15]:
print('number of elements in the audio features rows list:', len(rows))

number of elements in the audio features rows list: 844


Finally, let's load the audio features in a dataframe.

In [16]:
df_audio_features = pd.DataFrame.from_dict(rows,orient='columns')
print("Shape of the dataset:", df_audio_features.shape)
df_audio_features.head()

Shape of the dataset: (844, 18)


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.511,0.532,5,-5.745,1,0.056,0.169,0.0,0.311,0.322,137.827,audio_features,3k79jB4aGmMDUQzEwa46Rz,spotify:track:3k79jB4aGmMDUQzEwa46Rz,https://api.spotify.com/v1/tracks/3k79jB4aGmMD...,https://api.spotify.com/v1/audio-analysis/3k79...,219724,4
1,0.712,0.603,8,-5.52,1,0.0262,0.186,0.0,0.115,0.67,97.994,audio_features,1Lo0QY9cvc8sUB2vnIOxDT,spotify:track:1Lo0QY9cvc8sUB2vnIOxDT,https://api.spotify.com/v1/tracks/1Lo0QY9cvc8s...,https://api.spotify.com/v1/audio-analysis/1Lo0...,265493,4
2,0.847,0.622,1,-6.747,0,0.0903,0.119,0.0,0.285,0.22,130.001,audio_features,4rXLjWdF2ZZpXCVTfWcshS,spotify:track:4rXLjWdF2ZZpXCVTfWcshS,https://api.spotify.com/v1/tracks/4rXLjWdF2ZZp...,https://api.spotify.com/v1/audio-analysis/4rXL...,125040,4
3,0.668,0.758,5,-5.176,0,0.0332,0.483,1.9e-05,0.0837,0.834,147.989,audio_features,3dnP0JxCgygwQH9Gm7q7nb,spotify:track:3dnP0JxCgygwQH9Gm7q7nb,https://api.spotify.com/v1/tracks/3dnP0JxCgygw...,https://api.spotify.com/v1/audio-analysis/3dnP...,165671,3
4,0.569,0.724,6,-4.076,0,0.0474,0.228,0.0,0.27,0.562,83.118,audio_features,6pD0ufEQq0xdHSsRbg9LBK,spotify:track:6pD0ufEQq0xdHSsRbg9LBK,https://api.spotify.com/v1/tracks/6pD0ufEQq0xd...,https://api.spotify.com/v1/audio-analysis/6pD0...,194563,4


In [17]:
df_audio_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 844 entries, 0 to 843
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      844 non-null    float64
 1   energy            844 non-null    float64
 2   key               844 non-null    int64  
 3   loudness          844 non-null    float64
 4   mode              844 non-null    int64  
 5   speechiness       844 non-null    float64
 6   acousticness      844 non-null    float64
 7   instrumentalness  844 non-null    float64
 8   liveness          844 non-null    float64
 9   valence           844 non-null    float64
 10  tempo             844 non-null    float64
 11  type              844 non-null    object 
 12  id                844 non-null    object 
 13  uri               844 non-null    object 
 14  track_href        844 non-null    object 
 15  analysis_url      844 non-null    object 
 16  duration_ms       844 non-null    int64  
 1

To streamline the analysis, we need remove unnecessary columns from the dataset.

Furthermore, I will rename the "ID" column to "track_id" to ensure consistency with the column name in the initial dataframe.

In [18]:
columns_to_drop = ['analysis_url','track_href','type','uri']
df_audio_features.drop(columns_to_drop, axis=1,inplace=True)

df_audio_features.rename(columns={'id': 'track_id'}, inplace=True)

df_audio_features.shape

(844, 14)

### 5.8 Merging Track ID data and Audio Features
In order to obtain the detailed information about each track we need to merge both dataframes.

In [19]:
# merge both dataframes
# the 'inner' method will make sure that we only keep track IDs present in both datasets
df = pd.merge(df_tracks,df_audio_features,on='track_id',how='inner')
print("Shape of the dataset:", df_audio_features.shape)
df.head(20)

Shape of the dataset: (844, 14)


Unnamed: 0,artist_name,track_name,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Olivia Rodrigo,vampire,3k79jB4aGmMDUQzEwa46Rz,98,0.511,0.532,5,-5.745,1,0.056,0.169,0.0,0.311,0.322,137.827,219724,4
1,Luke Combs,Fast Car,1Lo0QY9cvc8sUB2vnIOxDT,92,0.712,0.603,8,-5.52,1,0.0262,0.186,0.0,0.115,0.67,97.994,265493,4
2,Gunna,fukumean,4rXLjWdF2ZZpXCVTfWcshS,94,0.847,0.622,1,-6.747,0,0.0903,0.119,0.0,0.285,0.22,130.001,125040,4
3,Eslabon Armado,Ella Baila Sola,3dnP0JxCgygwQH9Gm7q7nb,97,0.668,0.758,5,-5.176,0,0.0332,0.483,1.9e-05,0.0837,0.834,147.989,165671,3
4,Grupo Frontera,un x100to,6pD0ufEQq0xdHSsRbg9LBK,99,0.569,0.724,6,-4.076,0,0.0474,0.228,0.0,0.27,0.562,83.118,194563,4
5,Morgan Wallen,Last Night,59uQI0PADDKeE6UZDTJEe8,91,0.517,0.675,6,-5.382,1,0.0357,0.459,0.0,0.151,0.518,203.853,163855,4
6,Yng Lvcas,La Bebe - Remix,2UW7JaomAMuX9pZrjVpHAU,99,0.812,0.479,2,-5.678,0,0.333,0.213,1e-06,0.0756,0.559,169.922,234353,4
7,Bad Bunny,WHERE SHE GOES,7ro0hRteUMfnOioTFI5TG1,100,0.652,0.8,9,-4.019,0,0.0614,0.143,0.629,0.112,0.234,143.978,231704,4
8,PinkPantheress,Boy's a Liar Pt. 2,6AQbmUe0Qwf5PZnt4HmTXv,94,0.696,0.809,5,-8.254,1,0.05,0.252,0.000128,0.248,0.857,132.962,131013,4
9,Taylor Swift,I Can See You (Taylor’s Version) (From The Vault),5kHMfzgLZP95O9NBy0ku4v,92,0.694,0.764,6,-4.893,1,0.0337,0.0586,0.0,0.0608,0.819,123.044,273186,4


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 844 entries, 0 to 843
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist_name       844 non-null    object 
 1   track_name        844 non-null    object 
 2   track_id          844 non-null    object 
 3   popularity        844 non-null    int64  
 4   danceability      844 non-null    float64
 5   energy            844 non-null    float64
 6   key               844 non-null    int64  
 7   loudness          844 non-null    float64
 8   mode              844 non-null    int64  
 9   speechiness       844 non-null    float64
 10  acousticness      844 non-null    float64
 11  instrumentalness  844 non-null    float64
 12  liveness          844 non-null    float64
 13  valence           844 non-null    float64
 14  tempo             844 non-null    float64
 15  duration_ms       844 non-null    int64  
 16  time_signature    844 non-null    int64  
dt

Let's do the final check for duplicates.

In [21]:
df[df.duplicated(subset=['artist_name','track_name'],keep=False)]

Unnamed: 0,artist_name,track_name,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature


Now our dataset is ready to be analysed, we just need to export it as csv file.

```df.to_csv('SpotifyAudioFeatures.csv')```

## 6 Exploratory Data Analysis

**Import Required Libraries**

In [22]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Let's import and view first five rows of the dataset**

In [23]:
# Reading dataset and checking first 5 rows
df_track = pd.read_csv('archive/SpotifyAudioFeatures.csv')
df_track.head()

Unnamed: 0,artist_name,track_name,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Luke Combs,Fast Car,1Lo0QY9cvc8sUB2vnIOxDT,92,0.712,0.603,8,-5.52,1,0.0262,0.186,0.0,0.115,0.67,97.994,265493,4
1,Eslabon Armado,Ella Baila Sola,3dnP0JxCgygwQH9Gm7q7nb,98,0.668,0.758,5,-5.176,0,0.0332,0.483,1.9e-05,0.0837,0.834,147.989,165671,3
2,Grupo Frontera,un x100to,6pD0ufEQq0xdHSsRbg9LBK,99,0.569,0.724,6,-4.076,0,0.0474,0.228,0.0,0.27,0.562,83.118,194563,4
3,Morgan Wallen,Last Night,59uQI0PADDKeE6UZDTJEe8,91,0.517,0.675,6,-5.382,1,0.0357,0.459,0.0,0.151,0.518,203.853,163855,4
4,Yng Lvcas,La Bebe - Remix,2UW7JaomAMuX9pZrjVpHAU,99,0.812,0.479,2,-5.678,0,0.333,0.213,1e-06,0.0756,0.559,169.922,234353,4


**Check description of the data with describe() function present in Pandas**

In [24]:
df_track.describe()

Unnamed: 0,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0,826.0
mean,75.116223,0.654697,0.656878,5.122276,-6.304818,0.587167,0.104716,0.220984,0.019584,0.181999,0.497263,125.100018,193323.313559,3.895884
std,18.850379,0.143296,0.170749,3.56866,2.539417,0.492642,0.10455,0.23689,0.106318,0.126309,0.233015,28.397985,48835.809241,0.386195
min,0.0,0.18,0.102,0.0,-19.288,0.0,0.0232,2.1e-05,0.0,0.0322,0.0385,56.829,60720.0,1.0
25%,73.0,0.55125,0.55025,2.0,-7.56925,0.0,0.0373,0.031875,0.0,0.101,0.315,102.048,162986.25,4.0
50%,78.0,0.6625,0.675,5.0,-5.9785,1.0,0.05495,0.13,0.0,0.13,0.4865,125.539,189257.0,4.0
75%,83.0,0.768,0.789,8.0,-4.60025,1.0,0.12975,0.333,6.4e-05,0.2235,0.672,145.47175,217130.0,4.0
max,100.0,0.971,0.991,11.0,-0.484,1.0,0.798,0.985,0.945,0.948,0.972,208.138,500117.0,5.0


**Let's check the data types of various columns**

In [25]:
df_track.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 826 entries, 0 to 825
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist_name       826 non-null    object 
 1   track_name        826 non-null    object 
 2   track_id          826 non-null    object 
 3   popularity        826 non-null    int64  
 4   danceability      826 non-null    float64
 5   energy            826 non-null    float64
 6   key               826 non-null    int64  
 7   loudness          826 non-null    float64
 8   mode              826 non-null    int64  
 9   speechiness       826 non-null    float64
 10  acousticness      826 non-null    float64
 11  instrumentalness  826 non-null    float64
 12  liveness          826 non-null    float64
 13  valence           826 non-null    float64
 14  tempo             826 non-null    float64
 15  duration_ms       826 non-null    int64  
 16  time_signature    826 non-null    int64  
dt

**Find null values present in the dataset**

In [26]:
# Check null values in dataset
df_track.isnull().sum()

artist_name         0
track_name          0
track_id            0
popularity          0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
dtype: int64

As there are no missing values, we move to the next step and perform analysis.

### 6.1 Identify Key Audio Features

#### 6.1.1 Descriptive statistics
We need to calculate descriptive statistics for the audio feature columns to gain insights into their distribution, central tendency, and variability.

In [27]:
# Identify the audio feature columns
audio_features = ['danceability', 'energy', 'loudness', 'valence', 'tempo', 'popularity']

# Descriptive statistics
audio_stats = df_track[audio_features[:-1]].describe()
print(audio_stats)

       danceability      energy    loudness     valence       tempo
count    826.000000  826.000000  826.000000  826.000000  826.000000
mean       0.654697    0.656878   -6.304818    0.497263  125.100018
std        0.143296    0.170749    2.539417    0.233015   28.397985
min        0.180000    0.102000  -19.288000    0.038500   56.829000
25%        0.551250    0.550250   -7.569250    0.315000  102.048000
50%        0.662500    0.675000   -5.978500    0.486500  125.539000
75%        0.768000    0.789000   -4.600250    0.672000  145.471750
max        0.971000    0.991000   -0.484000    0.972000  208.138000
