<a href="https://colab.research.google.com/github/baut-jc/DDDS-My-Projects/blob/main/Project-4/Project_4_Spotify.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 4: Spotify

## Problem Definition


State the business problem. Translate the business problem into a Data Science problem by stating what kind of problem it is ( supervised vs unsupervised ) and whether it is a classification, regression, or clustering problem.

**Business Problem:** A record label or artist wants to forecast the potential popularity of a new song *before* its release to make informed marketing and strategic decisions.

**Data Science Problem:** This is a **supervised regression** task.

It's **supervised** because we have a dataset with features and a known target variable (*"Popularity"*).

It's regression because "Popularity" is a continuous numerical value, not a category.

**Primary Goal:** Minimize the Root Mean Squared Error (RMSE) when predicting the "Popularity" score.

**Secondary Goal:** Identify which audio features are most influential in predicting popularity.

## Data Collection/Sources


#### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn import metrics
import pickle

In [2]:
url = "https://ddc-datascience.s3.amazonaws.com/Projects/Project.4-Spotify/Data/Spotify.csv"
!curl -s -I {url}

HTTP/1.1 200 OK
[1mx-amz-id-2[0m: OWUXPGb+LtIn2SHsVjCSpFzcdvYe33opUfdtKKKwfNKcJJoDP354CtzlZZIcyuTyAiWHlEpXlkp26TvNRsCfNGSYbtQCgFp9n6uYOTDRpkI=
[1mx-amz-request-id[0m: KYVPBKDD7192A34Q
[1mDate[0m: Wed, 02 Jul 2025 16:58:20 GMT
[1mLast-Modified[0m: Wed, 04 Oct 2023 17:23:56 GMT
[1mETag[0m: "65b9875b11e0d7ea03ee2af024f45e99"
[1mx-amz-server-side-encryption[0m: AES256
[1mAccept-Ranges[0m: bytes
[1mContent-Type[0m: text/csv
[1mContent-Length[0m: 738124
[1mServer[0m: AmazonS3



In [3]:
!curl -s -O {url}

In [4]:
ls -la

total 740
drwxr-xr-x 1 root root   4096 Jul  2 16:58 [0m[01;34m.[0m/
drwxr-xr-x 1 root root   4096 Jul  2 16:57 [01;34m..[0m/
drwxr-xr-x 4 root root   4096 Jul  1 21:04 [01;34m.config[0m/
drwxr-xr-x 1 root root   4096 Jul  1 21:04 [01;34msample_data[0m/
-rw-r--r-- 1 root root 738124 Jul  2 16:58 Spotify.csv


In [5]:
!head -1 Spotify.csv | tr , '\n' | cat -n

     1	Index
     2	Highest Charting Position
     3	Number of Times Charted
     4	Week of Highest Charting
     5	Song Name
     6	Streams
     7	Artist
     8	Artist Followers
     9	Song ID
    10	Genre
    11	Release Date
    12	Weeks Charted
    13	Popularity
    14	Danceability
    15	Energy
    16	Loudness
    17	Speechiness
    18	Acousticness
    19	Liveness
    20	Tempo
    21	Duration (ms)
    22	Valence
    23	Chord


#### S.H.I.D

In [6]:
df = pd.read_csv( url )
df.shape

(1556, 23)

In [7]:
df.head()

Unnamed: 0,Index,Highest Charting Position,Number of Times Charted,Week of Highest Charting,Song Name,Streams,Artist,Artist Followers,Song ID,Genre,...,Danceability,Energy,Loudness,Speechiness,Acousticness,Liveness,Tempo,Duration (ms),Valence,Chord
0,1,1,8,2021-07-23--2021-07-30,Beggin',48633449,Måneskin,3377762,3Wrjm47oTz2sjIgck11l5e,"['indie rock italiano', 'italian pop']",...,0.714,0.8,-4.808,0.0504,0.127,0.359,134.002,211560,0.589,B
1,2,2,3,2021-07-23--2021-07-30,STAY (with Justin Bieber),47248719,The Kid LAROI,2230022,5HCyWlXZPP0y6Gqq8TgA20,['australian hip hop'],...,0.591,0.764,-5.484,0.0483,0.0383,0.103,169.928,141806,0.478,C#/Db
2,3,1,11,2021-06-25--2021-07-02,good 4 u,40162559,Olivia Rodrigo,6266514,4ZtFanR9U6ndgddUvNcjcG,['pop'],...,0.563,0.664,-5.044,0.154,0.335,0.0849,166.928,178147,0.688,A
3,4,3,5,2021-07-02--2021-07-09,Bad Habits,37799456,Ed Sheeran,83293380,6PQ88X9TkUIAUIZJHW2upE,"['pop', 'uk pop']",...,0.808,0.897,-3.712,0.0348,0.0469,0.364,126.026,231041,0.591,B
4,5,5,1,2021-07-23--2021-07-30,INDUSTRY BABY (feat. Jack Harlow),33948454,Lil Nas X,5473565,27NovPIUIRrOZoCHxABJwK,"['lgbtq+ hip hop', 'pop rap']",...,0.736,0.704,-7.409,0.0615,0.0203,0.0501,149.995,212000,0.894,D#/Eb


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1556 entries, 0 to 1555
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Index                      1556 non-null   int64 
 1   Highest Charting Position  1556 non-null   int64 
 2   Number of Times Charted    1556 non-null   int64 
 3   Week of Highest Charting   1556 non-null   object
 4   Song Name                  1556 non-null   object
 5   Streams                    1556 non-null   object
 6   Artist                     1556 non-null   object
 7   Artist Followers           1556 non-null   object
 8   Song ID                    1556 non-null   object
 9   Genre                      1556 non-null   object
 10  Release Date               1556 non-null   object
 11  Weeks Charted              1556 non-null   object
 12  Popularity                 1556 non-null   object
 13  Danceability               1556 non-null   object
 14  Energy  

In [9]:
df.describe().transpose().sort_values( by = ["mean"])

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Number of Times Charted,1556.0,10.66838,16.360546,1.0,1.0,4.0,12.0,142.0
Highest Charting Position,1556.0,87.744216,58.147225,1.0,37.0,80.0,137.0,200.0
Index,1556.0,778.5,449.322824,1.0,389.75,778.5,1167.25,1556.0


## Data Cleaning


In [10]:
df.nunique().sort_values(ascending = False)

# streams/song name could be the target
#will i know this feature that may have affected the target.

Unnamed: 0,0
Index,1556
Streams,1556
Song Name,1556
Song ID,1517
Duration (ms),1486
Tempo,1461
Loudness,1394
Acousticness,965
Weeks Charted,775
Speechiness,772


In [11]:
# check ALL nulls
df1= df.copy()
df1.isna().sum().sum()

np.int64(0)

## Exploratory Data Analysis


In [11]:
# corr plot to different target to compare what features may be affective to distinguish and compare features

## Processing



## Data Visualization/Communication of Results
