# Analysis of the Most Streamed Spotify Songs in 2023



## 1. Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan.

## 2. Data Understanding

This phase involves initial data collection and familiarization, including data cleaning, transformation, and exploration to identify quality issues and insights about the data.

In [1]:
# Importing the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.cluster import KMeans
from statsmodels.formula.api import ols

import folium as fl
import time

# Ignore warnings in the output
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load dataset
file_path = "https://raw.githubusercontent.com/diogo-costa-silva/assets/main/data/spotify-2023.csv"
df = pd.read_csv(file_path, encoding='ISO-8859-1')

# Creating a copy of the dataframe for cleaning
df_cleaned = df.copy()

In [3]:
df

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
948,My Mind & Me,Selena Gomez,1,2022,11,3,953,0,91473363,61,...,144,A,Major,60,24,39,57,0,8,3
949,Bigger Than The Whole Sky,Taylor Swift,1,2022,10,21,1180,0,121871870,4,...,166,F#,Major,42,7,24,83,1,12,6
950,A Veces (feat. Feid),"Feid, Paulo Londra",2,2022,11,3,573,0,73513683,2,...,92,C#,Major,80,81,67,4,0,8,6
951,En La De Ella,"Feid, Sech, Jhayco",3,2022,10,20,1320,0,133895612,29,...,97,C#,Major,82,67,77,8,0,12,5


The dataframe contains various features related to songs, artists, and their attributes or performance metrics across different platforms.

In order to better understand each feature present in the dataset, here's a quick overview of the dataset columns based on the initial few rows:

- track_name: The title of the tracks.
- artist(s)_name: Names of the artist(s) associated with each track.
- artist_count: The number of artists contributing to each track.
- released_year, released_month, released_day: The release date components for each track.
- Various metrics representing the track's presence and popularity on different music streaming platforms: in_spotify_playlists, in_spotify_charts, streams, in_apple_playlists, in_apple_charts, in_deezer_playlists, in_deezer_charts, in_shazam_charts
- bpm: The tempo of the track, measured in beats per minute.
- key: The key in which the track is composed.
- mode: The mode of the track (major or minor).
- Various metrics representing the track's musical qualities, including danceability_%, valence_%, energy_%, acousticness_%, instrumentalness_%, liveness_%, speechiness_%.



In [4]:
df.shape

(953, 24)

In [5]:
df.columns

Index(['track_name', 'artist(s)_name', 'artist_count', 'released_year',
       'released_month', 'released_day', 'in_spotify_playlists',
       'in_spotify_charts', 'streams', 'in_apple_playlists', 'in_apple_charts',
       'in_deezer_playlists', 'in_deezer_charts', 'in_shazam_charts', 'bpm',
       'key', 'mode', 'danceability_%', 'valence_%', 'energy_%',
       'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   artist_count          953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    object
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

### Explore unique feature values

In [7]:
#df['released_year'].unique()

array([2023, 2019, 2022, 2013, 2014, 2018, 2017, 2020, 2016, 2012, 1999,
       2008, 1975, 2021, 2015, 2011, 2004, 1985, 2007, 2002, 2010, 1983,
       1992, 1968, 1984, 2000, 1997, 1995, 2003, 1973, 1930, 1994, 1958,
       1957, 1963, 1959, 1970, 1971, 1952, 1946, 1979, 1950, 1942, 1986,
       2005, 1991, 1996, 1998, 1982, 1987])

In [8]:
#df['released_month'].unique()

array([ 7,  3,  6,  8,  5,  4,  1, 12,  2, 10, 11,  9])

In [9]:
#df['released_day'].unique()

array([14, 23, 30, 18,  1, 16,  7, 15, 17, 12, 31,  8, 24, 13, 22,  2, 25,
       29, 28, 21, 19, 10,  9, 26, 27,  6,  4,  3, 20,  5, 11])

In [10]:
#df['streams'].unique()

array(['141381703', '133716286', '140003974', '800840817', '303236322',
       '183706234', '725980112', '58149378', '95217315', '553634067',
       '505671438', '58255150', '1316855716', '387570742', '2513188493',
       '1163093654', '496795686', '30546883', '335222234', '363369738',
       '86444842', '52135248', '1297026226', '200647221', '115364561',
       '78300654', '899183384', '61245289', '429829812', '127408954',
       '22581161', '52294266', '843957510', '999748277', '618990393',
       '123122413', '188933502', '1355959075', '786181836', '176553476',
       '354495408', '2808096550', '1109433169', '1047101291', '65156199',
       '570515054', '1085685420', '1647990401', '2565529693', '518745108',
       '107753850', '177740666', '153372011', '57876440', '1813673666',
       '3703895074', '256483385', '1214083358', '16011326', '812019557',
       '111947664', '156338624', '720434240', '357925728', '674072710',
       '1755214421', '404562836', '373199958', '14780425', '395

 it seems there's a peculiar value: 'BPM110KeyAModeMajorDanceability53Valence75Energy69Acousticness7Instrumentalness0Liveness17Speechiness3'. This doesn't follow the numerical pattern that we would expect for a column that's supposed to represent streaming counts.

In [11]:
#df['in_deezer_playlists'].unique()

array(['45', '58', '91', '125', '87', '88', '43', '30', '48', '66', '54',
       '21', '745', '182', '863', '161', '78', '95', '23', '10', '42',
       '582', '32', '318', '41', '15', '143', '50', '13', '245', '165',
       '184', '34', '24', '410', '151', '6', '843', '537', '247', '65',
       '138', '458', '2,445', '74', '57', '213', '109', '3,394', '3,421',
       '39', '142', '73', '102', '4', '89', '4,053', '169', '31', '8',
       '707', '1,056', '164', '4,095', '68', '331', '80', '18', '1,003',
       '71', '25', '5', '798', '110', '1,800', '141', '2,703', '35', '29',
       '0', '69', '63', '1,632', '163', '19', '59', '2,394', '1,034',
       '327', '2,163', '695', '2,655', '476', '145', '47', '61', '246',
       '38', '52', '6,551', '1,212', '1,078', '7', '282', '254', '588',
       '1', '2,094', '2,969', '26', '3,889', '99', '5,239', '44', '3',
       '974', '356', '12', '453', '3,631', '113', '112', '435', '929',
       '939', '4,607', '806', '885', '28', '2,733', '3,425', '

In [12]:
#df['in_shazam_charts'].unique()

array(['826', '382', '949', '548', '425', '946', '418', '194', '953',
       '339', '251', '168', '1,021', '1,281', nan, '187', '0', '1,173',
       '29', '150', '73', '139', '1,093', '96', '211', '325', '294',
       '197', '27', '310', '354', '184', '212', '81', '82', '100', '62',
       '69', '727', '311', '1,133', '102', '332', '259', '140', '16',
       '110', '810', '176', '615', '210', '216', '215', '167', '37',
       '171', '272', '529', '26', '5', '169', '230', '84', '154', '93',
       '115', '72', '8', '323', '49', '1,451', '1,170', '429', '162',
       '10', '478', '236', '200', '78', '266', '486', '204', '34', '202',
       '312', '32', '153', '519', '458', '48', '666', '14', '925', '88',
       '203', '44', '74', '638', '64', '71', '2', '3', '136', '148', '22',
       '368', '1', '189', '52', '9', '31', '66', '208', '28', '558',
       '195', '13', '60', '503', '56', '15', '454', '40', '285', '129',
       '58', '117', '47', '20', '30', '80', '263', '116', '57', '39',
  

In [13]:
#df['bpm'].unique()

array([125,  92, 138, 170, 144, 141, 148, 100, 130,  83, 150, 118, 174,
        89, 120,  78, 140, 123, 135, 133,  99, 107, 122, 204, 110, 126,
       168,  98,  97, 180,  96,  95,  90, 128,  79, 134, 186,  67, 106,
       171, 137, 101, 173, 198,  82,  81,  94, 124, 132, 131, 102, 142,
       116, 129, 172, 136,  88, 143, 112,  93, 206,  84, 158, 117, 114,
       108, 121, 127, 139, 162, 146, 115, 119,  80, 160, 192, 163, 154,
       104, 164, 145,  85, 166, 109, 157,  74, 105, 155, 149, 169,  91,
       202, 153, 178, 176, 111, 182, 175,  87,  76, 113,  77, 177, 147,
        75, 103, 151, 152,  65, 179,  86,  73, 181, 161,  72, 184,  71,
       189, 200, 196, 188, 156, 183, 165])

In [14]:
#df['mode'].unique()

array(['Major', 'Minor'], dtype=object)

In [15]:
#df['danceability_%'].unique()

array([80, 71, 51, 55, 65, 92, 67, 85, 81, 57, 78, 52, 64, 44, 86, 63, 69,
       48, 79, 74, 56, 72, 61, 75, 60, 76, 77, 59, 68, 53, 45, 50, 84, 70,
       88, 90, 43, 62, 49, 58, 34, 91, 82, 83, 54, 87, 35, 42, 93, 47, 73,
       66, 33, 37, 89, 95, 94, 32, 40, 36, 25, 41, 46, 39, 24, 23, 27, 28,
       31, 29, 96, 38])

In [16]:
#df['liveness_%'].unique()

array([ 8, 10, 31, 11, 28, 27, 15,  3,  9, 16, 34, 12, 36, 42,  6, 14, 56,
       33, 19, 13,  7, 35, 23, 44, 17, 22, 25, 48, 43, 30, 20, 83, 38, 21,
       26, 29, 18, 32, 53,  5, 40, 50, 64, 37, 41, 45, 58, 91, 80,  4, 47,
       39, 61, 92, 52, 72, 46, 77, 66, 24, 60, 49, 97, 90, 67, 51, 63, 54])

In [17]:
#df['instrumentalness_%'].unique()

array([ 0, 63, 17,  2, 19,  1, 18,  3, 51,  8,  9,  4,  5, 25, 46, 10, 90,
       47, 35, 12, 13, 41, 24, 23,  6, 20, 30, 15, 91, 27, 72, 42, 14, 44,
       11, 61, 83, 22, 33])

In [18]:
#df['speechiness_%'].unique()

array([ 4,  6, 15, 24,  3,  9, 33,  5,  7, 16, 20, 28, 10, 25, 19, 14, 29,
        8, 13, 17, 34,  2, 11, 22, 12, 49, 21, 23, 64, 30, 39, 36, 42, 26,
       32, 35, 31, 38, 27, 46, 18, 37, 40, 41, 44, 43, 45, 59])

### 2.1. Data Pre-processing

Before we dive into individual data cleaning tasks, let's summarize the initial steps we need to undertake for Data Cleaning and Transformation:
<br>
1. Transforming Date Features:
<br>
The 'released_year', 'released_month', and 'released_day' fields are currently separate and in integer format. We need to combine these into a single datetime object to allow more efficient temporal analysis.
<br>
2. Cleaning Specific Fields:
<br>
The 'streams' field appears to have an inconsistent entry which we'll need to investigate and clean.
The 'in_deezer_playlists' and 'in_shazam_charts' fields contain numbers with commas, which should be standard integers. We'll convert these.
<br>
3. Reviewing Categorical Variables:
<br>
The 'key' and 'mode' fields are non-numeric and could be considered categorical. We'll review these to decide on the most appropriate treatment, potentially converting them into a category type for efficient processing.
<br>
4. Extended Data Exploration:
<br>
Once the data is cleaned, we will perform an extensive exploratory data analysis (EDA) to uncover insights, patterns, and potential issues in the data. This EDA will involve statistical summaries, visualizations, and various other techniques to understand the data deeply.



1. Convert 'released_year', 'released_month', and 'released_day' into a single datetime object.
2. Clean the 'streams' column and convert its data type.
3. Remove commas from 'in_deezer_playlists' and 'in_shazam_charts' and convert them to integers.
4. Discuss the potential conversion of 'key' and 'mode' into category types.
5. Handle NaN values in ‘in_shazam_charts’ and ‘key’.

We'll start by addressing the first item on our data cleaning list: converting the 'released_year', 'released_month', and 'released_day' columns into a single datetime column. This transformation is important because it allows for more efficient handling of the data, particularly for operations that involve date calculations, filtering, and aggregation.

In [19]:
# Step 1: Ensure the year, month, and day columns are integers (they should already be)
for col in ['released_year', 'released_month', 'released_day']:
    df_cleaned[col] = df_cleaned[col].astype(int)

# Step 2: Combine the year, month, and day into a single column (as a string)
df_cleaned['release_date'] = df_cleaned['released_year'].astype(str) + '-' + \
                             df_cleaned['released_month'].astype(str).str.zfill(2) + '-' + \
                             df_cleaned['released_day'].astype(str).str.zfill(2)  # zfill ensures a format like 2023-07-14

# Step 3: Convert the 'release_date' column to a datetime object
df_cleaned['release_date'] = pd.to_datetime(df_cleaned['release_date'], format='%Y-%m-%d')

# Step 4: (Optional) Drop the original 'released_year', 'released_month', and 'released_day' columns
# We will retain these columns for now, as they might be useful for analysis later on.

# Display the first few rows of the cleaned dataframe to verify our changes
df_cleaned[['released_year', 'released_month', 'released_day', 'release_date']].head()


Unnamed: 0,released_year,released_month,released_day,release_date
0,2023,7,14,2023-07-14
1,2023,3,23,2023-03-23
2,2023,6,30,2023-06-30
3,2019,8,23,2019-08-23
4,2023,5,18,2023-05-18


The 'released_year', 'released_month', and 'released_day' columns have been successfully combined into a new 'release_date' column, with the date represented as a datetime object. This format is more suitable for any subsequent operations that involve date calculations or aggregations.

Here's a brief overview of what we did:

We confirmed that the 'released_year', 'released_month', and 'released_day' columns were indeed integers.
We concatenated these columns into a single 'release_date' column, ensuring proper zero-padding for single-digit months and days to maintain the 'YYYY-MM-DD' format.
We converted the 'release_date' column from a string to a datetime object using pandas' to_datetime method, providing the specific format to ensure accurate parsing.
We opted to retain the original year, month, and day columns, considering they might be useful for certain types of analysis where these individual components are needed.
With this transformation, operations that require date functionality will be much more straightforward, such as sorting by date, calculating time spans, and aggregating based on specific time periods (e.g., monthly or yearly).

Next, we will handle the 'streams' column, which appears to have an inconsistent entry. We'll need to clean this data and convert the column to the appropriate data type.

Here are the steps we'll take to clean this column:

1. Investigate the unusual entry in the 'streams' column and determine whether it's an isolated case or part of a broader issue.
2. If it's an isolated incident, we may directly correct this entry. If it's a systematic issue, we'll need a more general approach.
3. Convert the 'streams' column to the appropriate numerical type (integer) after handling the anomalous entries.
4. Consider handling any other potential outliers or anomalies that could affect the analysis.

Let's begin by examining the problematic value and understanding the extent of this issue within the column.

In [20]:
# Step 1: Investigate the unusual entry in the 'streams' column.
problematic_entries = df_cleaned[df_cleaned['streams'].astype(str).str.contains(r'[a-zA-Z]')]

# Return the problematic entries for examination
problematic_entries[['track_name', 'streams']]


Unnamed: 0,track_name,streams
574,Love Grows (Where My Rosemary Goes),BPM110KeyAModeMajorDanceability53Valence75Energy69Acousticness7Instrumentalness0Liveness17Speechiness3


The investigation reveals that there is indeed an anomalous entry within the 'streams' column. Specifically, the track "Love Grows (Where My Rosemary Goes)" has a string of non-numeric characters, which appears to be a concatenation of various song attributes rather than the expected numerical value representing stream counts.

Given that this is an isolated incident, we have a few approaches to rectify this:

- Correction: If we know the correct number of streams or can retrieve it, we could directly replace the incorrect entry.
- Removal: We could remove this specific entry from our dataset to prevent it from skewing any analytical results.
- Imputation: In the absence of the correct data, we might impute a value based on similar tracks or the average number of streams, though this approach can introduce bias.

For the sake of maintaining the integrity of the dataset size and structure, and in the absence of the actual value, we might lean towards imputation. However, the ideal approach would be to retrieve the correct data if accessible.

In this scenario, without the actual streaming count, we'll opt to replace this entry with a 'NaN' so it won't interfere with numerical conversions and calculations. Later, we can decide if we want to impute an estimated value or handle the 'NaN' in another manner.

In [21]:
# Replace the problematic value with NaN (represented as np.nan in the dataset)
df_cleaned.loc[df_cleaned['streams'].astype(str).str.contains(r'[a-zA-Z]'), 'streams'] = np.nan

# Convert the 'streams' column to a float (since we introduced NaN, we can't convert to int directly)
df_cleaned['streams'] = df_cleaned['streams'].astype(float)

# Check if the conversion was successful and how the 'streams' column looks now
df_cleaned['streams'].describe()  # This will give us statistical info including count (to check if we have one less entry)

count    9.520000e+02
mean     5.141374e+08
std      5.668569e+08
min      2.762000e+03
25%      1.416362e+08
50%      2.905309e+08
75%      6.738690e+08
max      3.703895e+09
Name: streams, dtype: float64

The cleanup for the 'streams' column was successful. We replaced the anomalous entry with 'NaN' and converted the rest of the column to float type (since integers can't handle NaN values).

This column is now clean for numerical operations and analyses, although we'll need to decide later how to handle the 'NaN' entry, whether by imputation, removal, or some other strategy.

Next, we'll address the 'in_deezer_playlists' and 'in_shazam_charts' columns, which contain numbers with commas and should be converted to integers.

Here's our plan of action:

1. Replace the commas from the 'in_deezer_playlists' and 'in_shazam_charts' columns to eliminate the thousands separators.
2. Convert these cleaned columns to integers to enable numerical operations.
3. Verify the success of these operations by checking the data types or performing statistical summaries.


In [22]:
# Step 1: Remove commas from the specified columns
df_cleaned['in_deezer_playlists'] = df_cleaned['in_deezer_playlists'].str.replace(',', '').astype(float)
df_cleaned['in_shazam_charts'] = df_cleaned['in_shazam_charts'].str.replace(',', '').astype(float)

# Convert these columns to integers (we use float first because NaN values cannot be converted to int)
# Here, we are not converting to int after replacing commas because of the presence of NaN values.

# Step 3: Verify the operations by checking the data types of these columns and view statistical summaries.
data_types = df_cleaned[['in_deezer_playlists', 'in_shazam_charts']].dtypes

# Display the data types of the cleaned columns
data_types

in_deezer_playlists    float64
in_shazam_charts       float64
dtype: object

In [28]:
#df_cleaned['in_shazam_charts'].unique()

The cleaning for the 'in_deezer_playlists' and 'in_shazam_charts' columns was successful. We removed the commas and converted the values to the float data type. We used float instead of integer due to the presence of 'NaN' values, which are not compatible with the integer type.

These columns are now prepared for numerical analysis and mathematical operations, keeping in mind that we have 'NaN' values that we might need to address later, depending on the specific requirements of our subsequent analysis.

Next, we should discuss the potential categorization of the 'key' and 'mode' columns and then handle the NaN values in the 'in_shazam_charts' and 'key' columns.

The 'key' and 'mode' columns in your dataset represent categorical data, indicating the musical key and mode (major or minor) of each track. While these are represented as strings (object type) in your dataset, converting them to a category data type can be beneficial for several reasons:

- Efficiency: Category data type often uses less memory and can speed up operations like sorting and comparison.
- Integrity: It restricts the data to a specific set of values, ensuring consistency.
- Usefulness for Analysis: Categorical data is handy for statistical methods that are designed to handle categories rather than numerical data, and it's essential for certain visualizations and groupings.

Here's our plan for this part:

Examine the unique values in 'key' and 'mode' to understand the range of categories we're dealing with.
Convert 'key' and 'mode' to the category data type.
Validate the conversion.

### Basic descriptive statistics and general data checks

We'll check for any missing or duplicate values and understand the data types and summary statistics of each column. This step is crucial for deciding how to handle preprocessing in the Data Preparation phase.

In [24]:
# Descriptive statistics for numerical columns
desc_stats = df.describe()
desc_stats

Unnamed: 0,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,in_apple_playlists,in_apple_charts,in_deezer_charts,bpm,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
count,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0
mean,1.556139,2018.238195,6.033578,13.930745,5200.124869,12.009444,67.812172,51.908709,2.666317,122.540399,66.96957,51.43127,64.279119,27.057712,1.581322,18.213012,10.131165
std,0.893044,11.116218,3.566435,9.201949,7897.60899,19.575992,86.441493,50.630241,6.035599,28.057802,14.63061,23.480632,16.550526,25.996077,8.4098,13.711223,9.912888
min,1.0,1930.0,1.0,1.0,31.0,0.0,0.0,0.0,0.0,65.0,23.0,4.0,9.0,0.0,0.0,3.0,2.0
25%,1.0,2020.0,3.0,6.0,875.0,0.0,13.0,7.0,0.0,100.0,57.0,32.0,53.0,6.0,0.0,10.0,4.0
50%,1.0,2022.0,6.0,13.0,2224.0,3.0,34.0,38.0,0.0,121.0,69.0,51.0,66.0,18.0,0.0,12.0,6.0
75%,2.0,2022.0,9.0,22.0,5542.0,16.0,88.0,87.0,2.0,140.0,78.0,70.0,77.0,43.0,0.0,24.0,11.0
max,8.0,2023.0,12.0,31.0,52898.0,147.0,672.0,275.0,58.0,206.0,96.0,97.0,97.0,97.0,91.0,97.0,64.0


In [25]:
# Checking for missing values
missing_values = df.isnull().sum()
missing_values

track_name               0
artist(s)_name           0
artist_count             0
released_year            0
released_month           0
released_day             0
in_spotify_playlists     0
in_spotify_charts        0
streams                  0
in_apple_playlists       0
in_apple_charts          0
in_deezer_playlists      0
in_deezer_charts         0
in_shazam_charts        50
bpm                      0
key                     95
mode                     0
danceability_%           0
valence_%                0
energy_%                 0
acousticness_%           0
instrumentalness_%       0
liveness_%               0
speechiness_%            0
dtype: int64

In [26]:
# Checking for duplicates
num_duplicates = df.duplicated().sum()
num_duplicates

0

In [27]:
df_cleaned.dtypes

track_name                      object
artist(s)_name                  object
artist_count                     int64
released_year                    int64
released_month                   int64
released_day                     int64
in_spotify_playlists             int64
in_spotify_charts                int64
streams                        float64
in_apple_playlists               int64
in_apple_charts                  int64
in_deezer_playlists            float64
in_deezer_charts                 int64
in_shazam_charts               float64
bpm                              int64
key                             object
mode                            object
danceability_%                   int64
valence_%                        int64
energy_%                         int64
acousticness_%                   int64
instrumentalness_%               int64
liveness_%                       int64
speechiness_%                    int64
release_date            datetime64[ns]
dtype: object

## 3. Data Preparation

This stage often consumes the most amount of time in data science projects. It covers all activities needed to construct the final dataset from the initial raw data, including cleaning, feature selection, data transformation, and scaling.

## 4. Modeling

Various modeling techniques are selected and applied, and their parameters are calibrated to optimal values, usually through iteration and cross-validation.


## 5. Evaluation

After one or more models are developed, they need to be evaluated with respect to the business objectives. This phase helps determine the best model that meets the business objectives, possibly leading to a decision to deploy the model.


## 6. Deployment

The knowledge gained will need to be organized and presented in a way that the customer can use it. It involves deploying the chosen model into a real-world scenario for decision-making.