# CSMODEL Case Study - Phase 1

### Members:
- Angelo Guerra
- Adrian Yung Cheng
- Alina Sayo
- Mark Daniel Gutierrez

**Section**: S15

**Instructor**: Mr. Gabriel Avelino Sampedro

In this notebook, we will be using the **[Spotify Top Hits from 2000-2019](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019/data)** Dataset. The notebook will cover an analysis of the raw dataset and various processes to extract meaningful insights and conclusions from the data.


## Importing Libraries

First, import the necessary libraries to perform data operation throughout this notebook:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

plt.style.use("seaborn-v0_8-darkgrid")

* **Numpy** - Numpy is a software library for Python designed for working with arrays and encompassing functions related to linear algebra, fourier transforms, and matrices.
* **Pandas** - Pandas is a software library for Python designed for data manipulation and data analysis.
*  **Matplotlib** - Matplotlib is a software library for data visualization for Python, allowing us to easily render various types of graphs.
* **Seaborn** - Seaborn is a software library for data visualization for Python designed to create attractive and informative statistical graphics, making it easier to make complex visualizations compared to using Matplotlib alone.

## Dataset Description

### Brief Description

The dataset used throughout this notebook consists of a `.csv` file containing audio statistics of the top 2000 tracks from 2000-2019 on Spotify, a global audio streaming service. The data contains information about each track and its qualities, including the song artist, year it was released, popularity rate, and various characteristics. The dataset takes advantage of Spotify's huge collection of music to create a useful resource which could help those who want to statistically assess and analyze the platform's top hits from the past two decades.

### Collection Process

The dataset **[Top Hits Spotify 2000-2019](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019/data)** was collected from a popular digital music streaming service called Spotify and was posted on Kaggle by user Mark Koverha. This dataset draws from Spotify’s extensive music catalog to create a comprehensive resource. Spotify has created multiple playlists for the top hits of each year. The data was then extracted from these playlists and Koverha collected it using the `Spotipy` library for Python.

Currently, Spotify provides a publicly available API. The Spotify Web API and the `Spotipy` Python library were utilized to gather track informaton to compile the dataset.

### Dataset File Structure

Each entry in the dataset is a top hit song released during the year 2000 until 2019. On the other hand, each column represents a single component of the song that would differentiate it from other entries, such as its artist, title, and many other aspects, with ten (10) specific variables to account for.

The `read_csv()` function of the pandas library, in this case, is used to assign the dataset to a properly-usable variable in Python. The `info()` function, on the other hand, is used to display the general information about the dataset itself.

In [None]:
spotify_df = pd.read_csv('songs_normalize.csv')
spotify_df.info()

// add post interpretation on df.info
// add function explanation on shape

In [None]:
spotify_df.shape

// add post interpretation on shape TODO

We can then use the `describe()` function to provide a statistical description of the dataset and its variables containing numerical data, which includes the mean, count, standard deviation, and many more.

In [None]:
spotify_df.describe()

By calling the `head()` function of the pandas library, the program will now display the first 5 rows/entries of the dataset. 

In [None]:
spotify_df.head()

To display the last 5 rows of the dataset, the `tail()` function will be used.

In [None]:
spotify_df.tail()

### Variables

Each variable included in the dataset is a generalized form of specifying the different entries of the dataset in a substantial manner. The following are the variables used:

| Variable         | Description | Datatype |
|------------------|-------------|----------|
| Artist           | The track’s singer and writer.        | Object  |
| Title            | The track’s title.                    | Object  |
| Duration_ms      | The track’s duration is milliseconds. | Integer |
| Explicit         | Dictates whether or not the track contains explicit content or not. | Boolean     |
| Year             | The track’s release year.             | Integer |
| Popularity       | A measure that visualizes the popularity of the song. Ranges from 0 (least popular) to 100 (most popular). | Integer |
| Danceability     | A measure that represents the measure of a song’s beat strength, stability, and tempo. Ranges from 0 (least danceable) to 1 (most danceable). | Float |
| Energy           | A measure of the track’s intensity and activity. Ranges from 0 (least energy) to 1 (most energy). | Float |
| Key              | The music key of the track, represented as integers using standard Pitch Class notation. | Integer |
| Loudness         | A measure of the overall loudness of the track in decibels (dB), averaged across the entire track. | Float |
| Mode             | Indicates whether the track is in a major (1) or minor (0) key. | Integer |
| Speechiness      | Measures the presence of spoken words in the track, with values closer to 1.0 indicating more speech content. While values closer to 0.0 most likely represent music and other non-speech-like tracks. | Float |
| Acousticness     | A confidence measure of whether the track is acoustic, with 1.0 representing high confidence that the track is acoustic. | Float |
| Instrumentalness | Predicts whether the track contains vocals, with values closer to 1.0 indicating a higher likelihood of no vocal content. | Float |
| Liveness         | Detects the presence of an audience in the recording, with higher values suggesting a live performance. | Float |
| Valence          | A measure describing the music positiveness of the track, ranging from 0.0 (negative) to 1.0 (positive). Where tracks with high valence sound more positive and tracks with low valence sound more negative. | Float |
| Tempo            | The estimated tempo of the track in beats per minute (BPM). | Float |
| Genre            | The genre of the track.                | Object |

## Data Cleaning

### Removing Unused Variables

// place shit here TODO

// drop function

In [None]:
spotify_df = spotify_df.drop(['explicit', 'danceability', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'liveness', 'valence', 'genre'], axis = 1)
spotify_df.info()

### Check for Multiple Representations

To check whether or not the dataset contains multiple representations of values per variable, we can call the `unique()` function of the pandas library.

In [None]:
for col in spotify_df:
    print("'{}' column's unique values:\n".format(col), spotify_df[col].unique())
    print("")

According to the results displayed by the function, all unique values within each of the variables of the dataset fit its valid parameters. 

### Check for Incorrect Datatypes

// place shit here TODO

In [None]:
spotify_df.info()

Based on the datatypes displayed by the dataset, all of the variables have the proper corresponding datatype assigned to them.

### Check for Default Values

// place shit here TODO

In [None]:
for col in spotify_df:
    print("'{}' column's unique values:\n".format(col), spotify_df[col].unique())
    print("")

According to the results provided by the function, only the year variable contained data that did not fit its parameters (years from 2000 to 2019 only).

In [None]:
spotify_df = spotify_df[(spotify_df['year'] >= 2000) & (spotify_df['year'] <= 2019)]
spotify_df.info()

In [None]:
spotify_df.shape

After dropping the invalid records, calling the shape of the newly-updated dataset displayed only `1958 remaining entries`. This meant that out of the initial 2000 entries of the dataset, there were `42 invalid entries` removed.

### Check for Missing Data

To check for missing data within the dataset, we can use the `isnull()` function if there exists null values per variable.

In [None]:
spotify_df.isnull().any()

Based on the results provided by the function, all of the variables does not contain any null values.

### Check for Duplicate Data

To check for duplicated data within the dataset, we can use the `duplicated()` function to identify whether or not there are duplicate records.

In [None]:
spotify_df.duplicated().any()

Based on the provided result of the function, we can classify the dataset to contain duplicate records. We can use the `drop_duplicates` function to remove these records.

In [None]:
spotify_df = spotify_df.drop_duplicates()

In [None]:
spotify_df.shape

The new dataset that does not contain duplicate records now contain `1899 unique records`. This meant that there were `59 duplicate records` present in the initial dataset.

### Checking for Inconsistent Formatting

// place shit here TODO

In [None]:
for col in spotify_df:
    print("'{}' column's unique values:\n".format(col), spotify_df[col].unique())
    print("")

Upon checking the results, the dataset seems to not contain any inconsistencies in its formatting.

## Exploratory Data Analysis

1. What is the distribution of songs according to release year?

2. What is the distribution of songs according to duration?

3. What is the distribution of the songs according to their level of instrumentalness?

4. What is the distribution of the songs according to their level of tempo?

5. What is the relationships between duration, instrumentalness, and tempo with the release year?

### Question 1: What is the distribution of top hits according to release year?

// What variables will be used?



#### Numerical Summaries

// write me

In [None]:
spotify_df.groupby('year').agg({'song': ['count']})

#### Data Visualization

*Bar Plot*

// write me

In [None]:
# bar plot of the number of songs per year
plt.bar(spotify_df.groupby('year').agg({'song': ['count']}).index, spotify_df.groupby('year').agg({'song': ['count']})['song']['count'], color = 'pink')

# add labels
plt.title('Number of Top Hit Songs per Year')
plt.xlabel('Release Year')
plt.ylabel('Number of Top Hits')
plt.xticks(np.arange(2000, 2020, 2), rotation = 50)

plt.show()


// post interpretation

### Question 2: What is the average duration of top hits per year and 

// write me

#### Numerical Summaries

// mean of duration per year

In [None]:
# define ms_to_min_sec function
def ms_to_min_sec(ms):
    minutes = int(ms / 60000)
    seconds = int((ms % 60000) / 1000)
    return str(minutes) + ":" + str(seconds)

# Organizing a dataframe with the average duration of the top hits from the analyzed period
average_duration = spotify_df[['duration_ms','year']].groupby('year').mean().reset_index()
average_duration['min:sec'] = average_duration['duration_ms'].apply(ms_to_min_sec)
average_duration[['year','min:sec']]

# mean of duration per year 
spotify_df['min:sec'] = spotify_df['duration_ms'].apply(ms_to_min_sec)

# show
spotify_df

// post interpretation

#### Data Visualization

*Scatterplot*

write me

// post interpretation

### Question 3: What is the distribution of the songs according to their level of instrumentalness?

// write me

#### Numerical Summaries

// mean median std

post interpretation

#### Data Visualization

*Histogram*

write me

### Question 4: What is the distribution of the songs according to their level of tempo?

#### Numerical Summaries

// write me

#### Data Visualization

*Histogram*

// write me

post interpretation

### Question 5: What is the relationships between instrumentalness and tempo with the release year?

#### Numerical Summaries

#### Data Visualization

## Research Question

---
### Can we cluster and determine when a song is/was released depending on the following variables: (1) duration, (2) instrumentalness, and (3) tempo?
---

### Rationale for Research Question based on EDA

### Significance of Research Question