# CSMODEL Case Study - Phase 1

### Members:
- Angelo Guerra
- Adrian Yung Cheng
- Mark Daniel Gutierrez
- Alina Sayo

**Section**: S15

**Instructor**: Mr. Gabriel Avelino Sampedro

In this notebook, we will be using the **[Spotify Top Hits from 2000-2019](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019/data)** Dataset. The notebook will cover an analysis of the raw dataset and various processes to extract meaningful insights and conclusions from the data.


## Importing Libraries

First, import the necessary libraries to perform data operation throughout this notebook:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

plt.style.use("seaborn-v0_8-darkgrid")

* **Numpy** - Numpy is a software library for Python designed for working with arrays and encompassing functions related to linear algebra, fourier transforms, and matrices.
* **Pandas** - Pandas is a software library for Python designed for data manipulation and data analysis.
*  **Matplotlib** - Matplotlib is a software library for data visualization for Python, allowing us to easily render various types of graphs.
* **Seaborn** - Seaborn is a software library for data visualization for Python designed to create attractive and informative statistical graphics, making it easier to make complex visualizations compared to using Matplotlib alone.

## Dataset Description

### Brief Description

The dataset used throughout this notebook consists of a `.csv` file containing audio statistics of the top 2000 tracks from 2000-2019 on Spotify, a global audio streaming service. The data contains information about each track and its qualities, including the song artist, year it was released, popularity rate, and various characteristics. The dataset takes advantage of Spotify's huge collection of music to create a useful resource which could help those who want to statistically assess and analyze the platform's top hits from the past two decades.

### Collection Process

The dataset **[Top Hits Spotify 2000-2019](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019/data)** was collected from a popular digital music streaming service called Spotify and was posted on Kaggle by user Mark Koverha. This dataset draws from Spotify’s extensive music catalog to create a comprehensive resource. Spotify has created multiple playlists for the top hits of each year. The data was then extracted from these playlists and Koverha collected it using the `Spotipy` library for Python.

Currently, Spotify provides a publicly available API. The Spotify Web API and the `Spotipy` Python library were utilized to gather track informaton to compile the dataset.

### Dataset File Structure

Each entry in the dataset is part of the top 2000 hit songs released during the year 2000 until 2019. On the other hand, each column represents a single component of the track that would detail its qualities, such as its artist name, title, and many other aspects, with a total of eighteen (18) specific variables to account for. There are a total of **2000 entries** and **18 variables** in the dataset.

The `read_csv()` function of the pandas library, in this case, is used to assign the dataset to a properly-usable variable in Python. The `info()` function, on the other hand, is used to display the general information about the dataset itself.

Now, let us load the data from the `songs_normalize.csv` file using the `read_csv` function and assign the resulting dataframe to the variable `spotify_df`. Then, we can use the `info` function to display a quick summary of how the data is structured.

In [None]:
spotify_df = pd.read_csv('songs_normalize.csv')
spotify_df.info()

The `shape` function determines the dimensions of the dataframe (dataset) by getting the number of entries (rows) and variables (columns) and displaying them in a 2-tuple.

In [None]:
spotify_df.shape

Based on the code execution above, there are indeed **2000** entries and **18** variables present in the dataset.

We can then use the `describe()` function to provide a statistical description of the dataset and its variables that contain numerical data, which includes the count, mean, standard deviation, quartiles, and minimum/maximum values.

In [None]:
spotify_df.describe()

By calling the `head()` function of the pandas library, the program will now display the first 5 (default value) rows/entries of the dataset. 

In [None]:
spotify_df.head()

To display the last 5 (default value) rows of the dataset, the `tail()` function will be used.

In [None]:
spotify_df.tail()

### Variables

Each variable included in the dataset is a generalized form of specifying the different entries of the dataset in a substantial manner. The following are the variables used, their description/representation and data type:

| Variable         | Description/Representation | Datatype |
|------------------|----------------------------|----------|
| Artist           | The track's singer.    | Object |
| Title            | The track's name. | Object  |
| Duration_ms      | The track's duration in milliseconds.  | Integer |
| Explicit         | Dictates whether or not the track contains explicit content.    | Boolean |
| Year             | The track's release year.             | Integer |
| Popularity       | A measure that quantifies the popularity of the song. Ranges from 0 (least popular) to 100 (most popular). | Integer |
| Danceability     | A measure that represents the measure of a song's beat strength, rhythm stability, and tempo. Ranges from 0.0 (least danceable) to 1.0 (most danceable). | Float |
| Energy           | A measure of the track's intensity and activity. Ranges from 0.0 (least energy) to 1.0 (most energy). | Float |
| Key              | The music key of the track, represented as integers using the standard Pitch Class notation. If no key is detected, the value is -1. | Integer |
| Loudness         | A measure of the overall loudness of the track in decibels (dB), averaged across the entire track. | Float |
| Mode             | Indicates whether the track is in a major (1) or minor (0) scale. | Integer |
| Speechiness      | Measures the presence of spoken words in the track, with values closer to 1.0 indicating more speech content. While values closer to 0.0 most likely represent music and other non-speech-like tracks. | Float |
| Acousticness     | A confidence measure of whether the track is acoustic, with 1.0 representing high confidence that the track is acoustic. | Float |
| Instrumentalness | Predicts whether the track contains vocals, with values closer to 1.0 indicating a higher likelihood of no vocal content. | Float |
| Liveness         | Detects the presence of an audience in the recording, with higher values closer to 1.0 suggesting a live performance. | Float |
| Valence          | A measure describing the music positiveness of the track, ranging from 0.0 (negative emotions) to 1.0 (positive emotions). Where tracks with high valence sound happier and tracks with low valence sound sadder. | Float |
| Tempo            | The estimated tempo (speed or pace) of the track in beats per minute (BPM). | Float |
| Genre            | The genre of the track.    | Object |

## Data Cleaning

Before exploring and analyzing the dataset, data cleaning and preprocessing techniques shall first be performed to address inconsistencies within the dataset that could result in any erroneous data analysis.

The researchers have specifically checked for the following aspects in the dataset:
- Removing unused variables
- Checking for multiple representations in each variable
- Incorrect datatype of a variable
- Default values of a variable
- Missing data
- Duplicate data
- Inconsistent formatting of values

### Removing Unused Variables

To remove a specific entry or column from the dataset, the `drop()` function is called.

The researchers decided to remove the variables deemed unnecessary or outside the scope of this study. These consisted of numerical variables not concerned with the research question.

In [None]:
# Drop unnecessary variables
spotify_df = spotify_df.drop(['explicit', 'danceability', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'liveness', 'valence', 'genre'], axis = 1)
spotify_df.info()

The variables `explicit`, `danceability`, `key`, `loudness`, `mode`, `speechiness`, `valence`, and `genre` were removed because such components shall not be used in this study. In this way, the study will only dataset will only be focusing on the necessary variables.

### Check for Multiple Representations

To check whether or not the dataset contains multiple representations of values for variables with categorical data, the `value_counts()` function of the pandas library can be called to return the count of each unique value in the Series.

In [None]:
# Check if there are any mispelled or incorrect values in Artist
print(spotify_df['artist'].value_counts())

In [None]:
# Check if there are any mispelled or incorrect values in Song
print(spotify_df['song'].value_counts())

According to the results executed by the code, all unique values within the `artist` and `song` categorical variables of the dataset fit their valid parameters and are correctly represented.

### Check for Incorrect Datatypes

To check the data type of every variable/column in the dataset, the `dtypes` property of a dataframe in the pandas library can also be called aside from the `info()` function to assess whether the assigned datatype is correct for the corresponding variable. The code below will list down every variable and its corresponding data type.

In [None]:
spotify_df.dtypes

Based on the datatypes displayed from the code execution, all variables in the dataset have their appropriate corresponding datatype assigned to them based on their description.

### Check for Default Values

To check if there exists default values placed per variable in the dataset, the `unique()` method can be used to list down all unique values per variable.

In [None]:
for col in spotify_df:
    print("'{}' column's unique values:\n".format(col), spotify_df[col].unique())
    print("")

According to the results provided by the function, only the `year` variable contained data that did not fit its parameters (years from 2000 to 2019 only). The rest of the variables' unique values fit their respective parameters based on their description.

Given this finding, the dataframe could be preprocessed and filtered to include entries only within the valid timeframe through querying.

In [None]:
# Select entries that have a release year date from 2000 to 2019 only, then assign back to the dataframe
spotify_df = spotify_df[(spotify_df['year'] >= 2000) & (spotify_df['year'] <= 2019)]
spotify_df.shape

After dropping the invalid records, calling the `shape` of the newly-updated dataset displayed only **1958 remaining entries**. This meant that out of the initial 2000 entries of the dataset, there were **42 invalid entries** removed.

### Check for Missing Data

To check for missing data within the dataset, the `isnull()` and `any()` functions can be used to assess if there exists null values per variable represented as `NaN` or `null`. The given function call will list each variable with a boolean value indicating whether it has null values or none.

In [None]:
spotify_df.isnull().any()

Based on the results provided by the code execution, all the variables in the dataset do not contain any null values.

### Check for Duplicate Data

To check for duplicated data within the dataset, the `duplicated()`  and `any()` function can be called to identify whether or not there are duplicate records. The code below will either return a boolean value `True` if there exists repeated records and `False` if there are none.

In [None]:
spotify_df.duplicated().any()

Based on the provided result of the function, we can conclude the dataset contains duplicate records. Given this finding, the `drop_duplicates()` function can be used to remove these records.

In [None]:
spotify_df = spotify_df.drop_duplicates()
spotify_df.shape

The new dataset that does not include duplicate records now contain **1899 unique records**. This meant that there were **59 duplicate records** present in the initial dataset.

### Checking for Inconsistent Formatting

To check if there exists any values in the dataset with inconsistent formatting, the `unique()` function can once again be called. The code below will list down all the unique values in a given variable.

In [None]:
for col in spotify_df:
    print("'{}' column's unique values:\n".format(col), spotify_df[col].unique())
    print("")

Upon checking the results, the dataset's variables seems to not contain any values with inconsistencies in its formatting based on their corresponding description.

## Exploratory Data Analysis

We are performing Exploratory Data Analysis (EDA) in order to comprehend and orient the dataset in such a way that would be much more efficient to understand. Generally, this involves creating data representations and numerical summaries to properly visualize the nature of the data and create properly-oriented assumptions that would, upon requirement, formulate a specific research question.

One of the key variables included in the dataset is the `year` variable, which essentially indicates the year the top hit track got released. Given the dataset's emphasis on analyzing the evolution of the top hits' qualities over time, the variable and its corresponding relationships represent a potential point of interest that the EDA can be revolved around.

Among the 18 variables in the dataset, 14 are numerical in nature: `duration_ms`, `year`, `popularity`, `danceability`, `energy`, `key`, `loudness`, `mode`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence`, and `tempo`. However, the EDA for this notebook will only be focusing on the variables `duration_ms`, `instrumentalness` and `tempo` due to their consistent and applicable nature with relations to a song's release year. By analyzing the `year` variable in conjunction with the 5 numerical variables, the researchers can further narrow down the scope of the EDA as well as maintain some form onf consistency in the data analysis strategies used.

With the aforementioned attributes in mind, the following EDA questions are investigated to gain insights into possible patterns associated with the track's release year:

1. What is the distribution of songs according to release year?

2. What is (a) the distribution of songs according to duration and (b) the average duration of top hits per year?

3. What is (a) the distribution of the songs according to their level of instrumentalness and (b) the average instrumentalness level per year?

4. What is (a) the distribution of the songs according to their level of tempo and (b) the average tempo level per year?

5. What is the relationship between the tracks' duration, instrumentalness, and tempo with the release year?

### Question 1: What is the distribution of top hits according to release year?

To answer this question, we obtain our numerical summaries and data visualizations based on the `year` and `song` variables. The `year` variable is used to group and segregate the data into the different release years available, while the `songs` variable possess the top-hits songs released at a specified year.

To begin, we first reassign the required data taken from `spotify_df` into a new dataframe `songs_per_year`. This would indicate the distribution and relationship of the release years available in the `spotify_df` dataframe and the number of songs released each year.

#### Numerical Summaries

Now, we calculate the number of songs in the dataset for each release year. We will be using the `agg` function and `groupby` function to perform numerical summaries regarding the distribution of songs per year in the dataset. The `groupby` function is used to group the data by the `year` variable, while the `agg`  function is used to count the songs within each year. The use of these functions allows for efficient grouping and aggregation operations on the dataset.

In [None]:
songs_per_year = spotify_df.groupby('year').agg({'song':'count'})
songs_per_year

When observed, the new dataframe `songs_per_year` should contain both the release years present in the dataset and their corresponding number of top-hit songs. It is apparent from this data that the song count per year is not equal, which implies that the music industry may have experienced variations in the number of top hits released annually.

In particular, in the year 2000, there were only 71 top hits—marking it as a year with relatively fewer chart-toppers. In contrast, the year 2012 stands out with the highest number of top hits with a total of 113 songs. This variance in the song count over the years showcases the fluctuation in music trends and popularity over the analyzed period.

#### Data Visualization

*Bar Plot*

One of the methods we can utilize to visualize the distribution of `songs` per `year` is a **bar plot** to represent the number of top hits included per release year.

By utilizing matplotlib's `barplot` function, we set the variable on the y-axis to the aggregate of `song`, which depends on the variable on the x-axis, which is the song's release `year`. To make the visualization easier, the data is sourced from the `songs_per_year` dataframe, along with the setting of the desired figure size.

In [None]:
# bar plot of the number of songs per year
plt.figure(figsize=(12, 5))
ax = sb.barplot(x = 'year', y ='song', data = songs_per_year, color='pink')

# add labels
plt.title('Number of Top Hit Songs per Year')
plt.xlabel('Release Year')
plt.ylabel('Number of Top Hits')
plt.xticks(rotation = 50)

# label on top of each bar
for container in ax.containers:
    ax.bar_label(container, size=15)

plt.show()

Similar to the aforementioned numerical summaries, the bar plot displayed above illustrates the number of top hits in each year from 2000 to 2019. The data shows that 2012 had the most number of top hits with 113 songs, making it the most abundant year for hit songs in the dataset. On the other hand, 2000 had the fewest top hits, with only 71 songs.

Analyzing the trends as the years go by, the early 2000s saw a moderate number of hits, with a steady increase from 2000 to 2005. The count then fluctuated from year to year, with a peak occurring in 2012, possibly reflecting a period of musical diversity and commercial success, followed by a steady decline in the subsequent years. This variation may reflect chaning trends in music popularity, industry dynamics, or other external factors.

### Question 2: What is (a) the distribution of songs according to duration and (b) the average duration of top hits per year?

The second question centers around 2 relative concepts: the distribution of songs according to duration, and the average duration of top hits per year. This question mainly revolves around the usage of the `duration_ms` variable and its respective relationship with the `year` variable.

To efficiently obtain the necessary results centered around the `duration_ms` variable, we will be employing the usage of a conversion function that would essentially convert the song duration obtained from the dataframe and change it into a `minute:second` format.

In [None]:
# convert ms to min:sec
def ms_to_min_sec(ms):
    minutes = int(ms / 60000)
    seconds = int((ms % 60000) / 1000)

    if seconds < 10:
        seconds_str = '0' + str(seconds)
    else:
        seconds_str = str(seconds)

    return str(minutes) + ":" + seconds_str

#### Numerical Summaries


We first find the mean, median, and stadard deviation of the `duration_ms` variable to help us understand the central tendencies and the spread of song durations in the dataset. We can find the mean, median, and standard deviation of the `duration_ms` variable using the `agg` function.


In [None]:
duration_table = pd.DataFrame({'mean': mean, 'median': median, 'std': std}, index=['duration in min:sec'])
duration_table

Based on the observations above, the average length of top hits is approximately 3 minutes and 48 seconds based on the `mean` duration. The `median` duration, which is approximately 3 minutes and 42 seconds represents the middle value of the durations when they are sorted which reveals the typical duration in the dataset. The `standard deviation`, which is approximately 39 seconds measures the variability or spread in song durations.

Next, we aim at organizing a DataFrame that displays the `mean` and `median` duration of top hits for each year inb the given period. We do this by calculating the `mean` and `median` song duration in miliseconds for each year using the functions `mean()` and `median()` respectively. Then, we convert these time units into minutes and seconds by applying the `ms_to_min_sec()` conversion function. Finally, the resulting DataFrame displays the years and their respective average durations in both mean and median values.

In [None]:
# Organizing a dataframe with the average duration of the top hits from the analyzed period
average_duration = spotify_df[['duration_ms','year']].groupby('year').mean().reset_index()
average_duration['min:sec'] = average_duration['duration_ms'].apply(ms_to_min_sec)
average_duration[['year','min:sec']]

# Organizing a dataframe with the median duration of the top hits from the analyzed period
median_duration = spotify_df[['duration_ms','year']].groupby('year').median().reset_index()
median_duration['min:sec'] = median_duration['duration_ms'].apply(ms_to_min_sec)
median_duration[['year','min:sec']]

# aggregate both mean and median duration
mean_median_duration = pd.merge(average_duration, median_duration, on='year')
mean_median_duration = mean_median_duration.rename(columns={'min:sec_x': 'mean', 'min:sec_y': 'median',
                                                             'duration_ms_x': 'mean_ms', 'duration_ms_y': 'median_ms'})

mean_median_duration.round(1)

// Mean

Based on the results above, the resulting DataFrame includes 2 columns which are the `year` that represents the release year and `min:sec` that shows the average sing duration in `"minutes:seconds"` for each year within the specific period. It can also be noted that the year 2019 has the shortest average song duration with 3 minutes and 16 seconds and 2002 has the longest average song duration with 4 minutes and 11 seconds. 

Based on the results above, the DataFrame resulted into two columns: `year` representing the release year and the `min:sec` which shows the median song duration in `"minutes:seconds"` for each year. 

The analysis of song durations for the top hits from 2000 to 2019 provides valuable insight into the characteristic of popular music during this time period. On average, the duration of top hit songs is approximately 3 minutes and 48 seconds with variations. While mean duration provides an idea of the typical length of these songs, the median duration indicates that many songs cluster around 3 minutes and 42 seconds. The standard deviation highlights the variability in song lengths, indicating that there are both shorter and longer songs among the top hits. Furthermore, a chronological examination reveals that specific years, namely 2000 and 2019, exhibited songs with significantly lengthier durations, whereas others, such as 2002 and 2012, featured comparatively shorter durations. These findings provide valuable context for understanding the evolving trends in song durations for the top hits over this two-decade period.


#### Data Visualization

*Histogram*

// TODO overall distirbution of the duration on a 20-year timeline

In [None]:
# histogram for distribution of song duration over the years
plt.figure(figsize=(12, 5))
ax = sb.histplot(data=spotify_df, x='duration_ms', bins=20, color='pink')

# add vertical line for mean
plt.axvline(x = np.mean(spotify_df.duration_ms), color = 'red', linestyle = '--')
plt.text(x = np.mean(spotify_df.duration_ms) + 0.01, y = 200, s = '3:48 Mean', color = 'red', fontsize = 12)

# add vertical line for median
plt.axvline(x = np.median(spotify_df.duration_ms), color = 'blue', linestyle = '--')
plt.text(x = np.median(spotify_df.duration_ms) + 0.01, y = 300, s = '3:42 Median', color = 'blue', fontsize = 12)

# add labels
plt.title('Distribution of Song Duration')

plt.show()

// TODO post interpretation

*Line Plot*

// TODO mean duration of top hits per year

In [None]:
# TODO code here
# line plot of the duration of the top hits over the years
plt.figure(figsize=(12, 5))
ax = sb.lineplot(data=spotify_df, x='year', y='duration_ms', color='pink')

# add labels
plt.title('Mean Duration of Top Hits over the Years')
plt.xlabel('Release Year')
plt.ylabel('Mean Duration (min:sec)')
plt.xticks(np.arange(2000, 2020, 2), rotation=50)

plt.show()

// TODO post interpretation

on 95% CL (standard)

*Boxplot*

// TODO median duration of top hits PER YEAR
// basically violin plot daw kasi is a boxplot+historgram combined so we can easily see yung central tendency (median) saka spread (histogram/std)

In [None]:
# TODO box plot for distribution of average song duration over the years
plt.figure(figsize=(12, 5))
ax = sb.boxplot(data=spotify_df, x='year', y='duration_ms', color='pink')

# add labels
plt.title('Distribution of Song Duration Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Duration (ms)')
plt.xticks(rotation = 50)

plt.show()

// TODO post interpretation

### Question 3: What is (a) the distribution of the songs according to their level of instrumentalness and (b) the average instrumentalness level per year?

Similar to the previous question, the third question also revolves around the relationship between the instrumentalness level of each song, represented by `instrumentalness`, and its corresponding release year, represented by `year`. This specific question observes the distribution of the songs according to instrumentalness, and the average level of instrumentalness per year.

#### Numerical Summaries

To obtain the mean, median, and standard deviation of the `instrumentalness` variable in the dataset, we can use the `agg` function. The `mean` refers to the average of the instrumentalness values observed in the songs. The `median` represents the middle value that indicates the central tendency of this particular feature. Lastly, the `std` denotes the standard deviation, which measures the extent of dispersion or variability in instrumentalness among the songs. 


In [None]:
# get the mean median and std
spotify_df.agg({'instrumentalness': ['mean', 'median', 'std']})



Based on the above results, the `mean` instrumentalness value is approximately 0.0155. This suggests that, on average, the top hit songs tend to be primarily non-instrumental. The `median` instrumentalness value is 0.0000. This finding is intriguing, as it indicates that half of the top hit songs in our dataset have no detectable instrumental elements. These songs are essentially vocal-driven and may contribute to their broad appeal and accessibility.The `standard deviation` of approximately 0.0890 reveals some variability in the instrumentalness of these songs. While the mean and median values point to a preference for non-instrumental tracks, the standard deviation indicates that there are exceptions. Some songs may have higher instrumentalness, but they appear to be less common among top hits.

Here, we aim to calculate the average `instrumentalness` rate for top hit songs over the years. The code extracts the `instrumentalness` and `year` columns from the dataset and then groups the data by the `year` to calculate the mean instrumentalness for each year.

In [None]:
# TODO avarage instrumentalness rate per year
instrumentalness = spotify_df[['instrumentalness','year']].groupby('year').mean().reset_index()
instrumentalness.round(2)

// TODO post interpretation

#### Data Visualization

*Histogram*

Histograms are utilized to represent the distribution of the `instrumentalness` variable. The `histplot` function can accomplish this with ease, representing the instrumentalness levels as the bars in `"pink"`. 

In [None]:
# generate the histogram
plt.figure(figsize=(12, 5))
ax = sb.histplot(data = spotify_df, x = 'instrumentalness', bins = 30, color = 'pink')

# add vertical line for mean
plt.axvline(x = np.mean(spotify_df.instrumentalness), color = 'red', linestyle = '--')
plt.text(x = np.mean(spotify_df.instrumentalness) + 0.01, y = 200, s = '0.02 Mean', color = 'red', fontsize = 12)

# add vertical line for median
plt.axvline(x = np.median(spotify_df.instrumentalness), color = 'blue', linestyle = '--')
plt.text(x = np.median(spotify_df.instrumentalness) + 0.01, y = 400, s = '0.00 Median', color = 'blue', fontsize = 12)

# add labels
plt.title('Instrumentalness Distribution of Top Hit Songs')
plt.xlabel('Instrumentalness')
plt.xticks(np.arange(0, 1.1, 0.1))
plt.ylabel('Number of Top Hits')

plt.show()


// TODO post interpretation

*Line Plot*

// TODO average instrumentalness of top hits PER YEAR

In [None]:
# TODO Line plot
plt.figure(figsize=(12, 5))
ax = sb.lineplot(data = spotify_df, x = 'year', y = 'instrumentalness', color = 'pink')

# add labels
plt.title('Instrumentalness of Top Hits over the Years')
plt.xlabel('Release Year')
plt.ylabel('Instrumentalness')
plt.xticks(np.arange(2000, 2020, 2), rotation=50)

plt.show()

// TODO Post interpretation

*Box Plot*

// TODO

In [None]:
# TODO box plot
plt.figure(figsize=(12, 5))
ax = sb.boxplot(data=spotify_df, x='year', y='instrumentalness', color='pink')

# add labels
plt.title('Distribution of Instrumentalness Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Instrumentalness')
plt.xticks(rotation = 50)

plt.show()

// post interpretation

### Question 4: What is (a) the distribution of the songs according to their level of tempo and (b) the average tempo level per year?

The current EDA question delves into the distribution of songs according to their tempo, and the average level of tempo per year, similar to the previous questions. To do this, we focus on the `year` and `tempo` variables.

#### Numerical Summaries

We will once again use the `agg` function to obtain the mean, median, and standard deviation of the tempo levels of the entire dataframe year period. 

In [None]:
# get the mean median and std
spotify_df.agg({'tempo': ['mean', 'median', 'std']})


// TODO post-interpretation

// TODO average tempo per year

In [None]:
# TODO average tempo per year
tempo = spotify_df[['tempo','year']].groupby('year').mean().reset_index()
tempo.round(2)

// TODO post interpretation

#### Data Visualization

*Histogram*

We visualize the data of `tempo` using a histogram once again in order to include the decimal ranges. To accomplish this, we have used the `histplot` function for data visualization, and a `mean` function to represent the average of the current data.

In [None]:
# generate the histogram
plt.figure(figsize=(12, 5))
ax = sb.histplot(data = spotify_df, x = 'tempo', bins = 30, color = 'pink')

# add vertical line for mean
plt.axvline(x = np.mean(spotify_df.tempo), color = 'red', linestyle = '--')
plt.text(x = np.mean(spotify_df.tempo) + 0.01, y = 200, s = '120.12 bpm Mean', color = 'red', fontsize = 12)

# add vertical line for median
plt.axvline(x = np.mean(spotify_df.tempo), color = 'blue', linestyle = '--')
plt.text(x = np.mean(spotify_df.tempo) + 0.01, y = 100, s = '120.03 bpm Median', color = 'blue', fontsize = 12)

# add labels
plt.title('Instrumentalness Distribution of Top Hit Songs')
plt.xlabel('Instrumentalness')
plt.ylabel('Number of Top Hits')

plt.show()


post interpretation

*Line Plot*

// TODO

In [None]:
# TODO code
plt.figure(figsize=(12, 5))
ax = sb.lineplot(data = spotify_df, x = 'year', y = 'tempo', color = 'pink')

# add labels
plt.title('Tempo of Top Hits over the Years')
plt.xlabel('Release Year')
plt.ylabel('Tempo')
plt.xticks(np.arange(2000, 2020, 2), rotation=50)

plt.show()

// TODO post interpretation

*Box Plot*

// TODO 

In [None]:
# TODO code
plt.figure(figsize=(12, 5))
ax = sb.boxplot(data=spotify_df, x='year', y='tempo', color='pink')

# add labels
plt.title('Distribution of Tempo Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Tempo')
plt.xticks(rotation = 50)

plt.show()

// TODO post interpretation

### Question 5: What is the relationship between the tracks' duration, instrumentalness, and tempo with the release year?

Lastly, to find the general relationship between all the variables utilized (`duration_ms`, `instrumentalness`, `tempo`) towards the `year` variable. we must compute the values corresponding to their correlation.

#### Numerical Summaries

We can use the `.corr()` function of the pandas library to easily obtain the correlation values.

In [None]:
instru_tempo_year = spotify_df[[ 'year', 'duration_ms', 'instrumentalness', 'tempo']]
corr = instru_tempo_year.corr()
corr

#### Data Visualization

*Scatterplot*

// TODO

In [None]:
# TODO scatter plot of year vs instrumentalness
ax = sb.lmplot(data = spotify_df, x = 'year', y = 'instrumentalness', scatter_kws = {'s': 100, 'alpha': 0.3, 'color': 'pink'}, line_kws = {'color': 'purple'})

# adjust size
ax.figure.set_size_inches(12, 8)

# add labels
plt.title('Instrumentalness of Top Hits over the Years')
plt.xlabel('Release Year')
plt.ylabel('Instrumentalness')
plt.xticks(np.arange(2000, 2020, 2), rotation=50)

plt.show()

In [None]:
# scatter plot of year vs tempo
ax = sb.lmplot(data = spotify_df, x = 'year', y = 'tempo', scatter_kws = {'s': 100, 'alpha': 0.3, 'color': 'pink'}, line_kws = {'color': 'purple'})

# adjust size
ax.figure.set_size_inches(12, 8)

# add labels
plt.title('Tempo of Top Hits over the Years')
plt.xlabel('Release Year')
plt.ylabel('Tempo')
plt.xticks(np.arange(2000, 2020, 2), rotation=50)

plt.show()

In [None]:
# scatter plot of year vs duration
ax = sb.lmplot(data = spotify_df, x = 'year', y = 'duration_ms', scatter_kws = {'s': 100, 'alpha': 0.3, 'color': 'pink'}, line_kws = {'color': 'purple'})

# adjust size
ax.figure.set_size_inches(12, 8)

# add labels
plt.title('Duration of Top Hits over the Years')
plt.xlabel('Release Year')
plt.ylabel('Duration')
plt.xticks(np.arange(2000, 2020, 2), rotation=50)

plt.show()

// TODO post interpretation

*Heatmap*

After computation, we can proceed to visualization using seaborn to display a correlation heatmap for easy analysis with the help of the `subplots`, `triu`, `diverging_palette`, and `heatmap functions.`

In [None]:
# Heatmap Design
f, ax = plt.subplots(figsize = (12, 10))
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sb.diverging_palette(230, 20, as_cmap=True)

sb.heatmap(corr, annot=True, cmap=cmap, mask=mask)

plt.show()

# TODO include duration in the correlation matrix

## Research Question

In the Exploratory Data Analysis portion of this notebook, the questions stated were heavily oriented towards understanding the relationship between the different numerical variables present in the dataset with a song's release year, specifically the `year` variable. In doing so, the goal of the EDA was to determine whether or not a trend was present in the relationships acquired. By focusing on the numerical values present in the dataset, we were able to visualize and confirm that the data could be used in a predictive manner.

As such, with the proper investigation and solving the questions stated in the Exploratory Data Analysis portion of the notebook, the following research question was hypothesized:

---
### Can we cluster and determine when a song is/was released depending on the following variables: (1) duration, (2) instrumentalness, and (3) tempo?
---

### Rationale for Research Question based on Exploratory Data Analysis

write me

// identify opportunities for clustering among our variables
// expound on features included in rq: duration/instru/tempo
// findings: inferences made 

### Significance of Research Question

Providing insights and answers for this research question would have the potential to aid in identifying how music has evolved over time and how they reflect the musical trend of each period. Doing so would enable people to obtain a deeper and more meaningful understanding of the historical and cultural context and meaning of music. The study may also improve current predictive measures and further expand its applications in technological and musical aspects such as artificial intelligence prediction, music recommendation systems, music generation, etc.