## **Spotify Analysis**

## **Group Members:** 
* Antonia Sunseri 
* Manith Kumarapeli

## **Problem Statement**

The problem that our project seeks to investigate is temporal trends in the duration of top-hit songs from 2000 to 2019, using the "Top Hits Spotify from 2000–2019" dataset available on Kaggle (https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019). 

Our project will address potential cultural and economic factors influencing song pozpularity, including the shortening of attention spans in the digital age and the influence of streaming platforms, which may favor shorter tracks to maximize play counts and revenue. By examining the changes in average song duration over time, this analysis will strive to uncover patterns that reflect shifts in music consumption and popularity trends.

Additionally, we will analyze an artist's contribution to a song's popularity. For example, whether having a contributing artist is more likely to make a song popular, or whether a remix of an original song, a radio edit, or genre changes with additional artists strengthen a songs popularity. We will also investigate how the evolution of music genres over time, including whether earlier generations tend to prefer certain genres. 

## **Analysis & Wrangling Methods**

### **Wrangling Methods** 

* Checking for missing values
* Checking for duplicates
* Outlier detection & Removal 

In [None]:
# Checking for missing values 
import pandas as pd

dataframe_1 = pd.read_csv('songs_normalize.csv')
dataframe_1.isnull().sum().sort_values(ascending=False)

artist              0
song                0
tempo               0
valence             0
liveness            0
instrumentalness    0
acousticness        0
speechiness         0
mode                0
loudness            0
key                 0
energy              0
danceability        0
popularity          0
year                0
explicit            0
duration_ms         0
genre               0
dtype: int64

No missing values were present in the dataset. Therefore, there is no need for imputing missing values. 

In [3]:
# Removing duplciate rows

dataframe_filtered = dataframe_1.drop_duplicates()
duplicate_rows = len(dataframe_1) - (len(dataframe_filtered)) 
print(duplicate_rows)

59


59 duplicated rows are present. If there is a machine learning algorithm applied these rows could skew the results and introduce potential bias. 

In [6]:
# skim through the data
for column in dataframe_1.columns:
    print(f"\nColumn: {column}")
    print(dataframe_1[column].unique())

print(dataframe_1.dtypes)


Column: artist
['Britney Spears' 'blink-182' 'Faith Hill' 'Bon Jovi' '*NSYNC' 'Sisqo'
 'Eminem' 'Robbie Williams' "Destiny's Child" 'Modjo' "Gigi D'Agostino"
 'Eiffel 65' "Bomfunk MC's" 'Sting' 'Melanie C' 'Aaliyah' 'Anastacia'
 'Alice Deejay' 'Dr. Dre' 'Linkin Park' 'Tom Jones' 'Sonique' 'M.O.P.'
 'Limp Bizkit' 'Darude' 'Da Brat' 'Moloko' 'Chicane' 'DMX'
 'Debelah Morgan' 'Madonna' 'Ruff Endz' 'Montell Jordan' 'Kylie Minogue'
 'JAY-Z' 'LeAnn Rimes' 'Avant' 'Enrique Iglesias' 'Toni Braxton' 'Bow Wow'
 'Missy Elliott' 'Backstreet Boys' 'Samantha Mumba' 'Mýa' 'Mary Mary'
 'Next' 'Janet Jackson' 'Ricky Martin' 'Jagged Edge' 'Mariah Carey'
 'Baha Men' 'Donell Jones' 'Oasis' 'DJ Ötzi' 'P!nk' 'Craig David'
 'Christina Aguilera' 'Red Hot Chili Peppers' 'Sammie' 'Santana' 'Kandi'
 'Vengaboys' 'Ronan Keating' 'Madison Avenue' 'Céline Dion' '3 Doors Down'
 'Carl Thomas' 'Mystikal' 'Fuel' 'Savage Garden' 'Westlife' 'All Saints'
 'Erykah Badu' 'Marc Anthony' 'Matchbox Twenty' 'Gabrielle' 'Creed'


There seems to be no issue with the data types in the data frame as they seem to match exactly with the data types for each column.

## **Initial Insights & Hypothesis**

## **Modeling/Analysis Strategy** 

## **Project Goals**

## **Problems Faced**