1️⃣ Introduction
Welcome! In this notebook, we’ll explore how Python and Machine Learning can help us predict whether a song will be a hit.

We’ll:
✅ Learn Python basics
✅ Explore a dataset of songs
✅ Train a simple machine learning model to predict popularity

Let’s get started! 🎵🐍

2️⃣ Python Basics
Before diving into Machine Learning, let’s cover some Python essentials.

Python Syntax & Variables
Run this cell to see Python in action:

In [None]:
# Simple Python example
song_name = "Shake It Off"
artist = "Taylor Swift"
popularity = 95  # Popularity score (0-100)

print(f"{song_name} by {artist} has a popularity score of {popularity}.")


Shake It Off by Taylor Swift has a popularity score of 95.


Lists & Loops

In [None]:
# List of songs
songs = ["Shake It Off", "Blinding Lights", "Uptown Funk"]

# Loop through songs
for song in songs:
    print(f"Now playing: {song}")

Now playing: Shake It Off
Now playing: Blinding Lights
Now playing: Uptown Funk


3️⃣ Loading & Exploring the Music Dataset
We’ll use the Spotify dataset, which includes features like tempo, energy, and danceability.

Install & Import Required Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Load dataset
url = "hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv"
df = pd.read_csv(url)

# Show first few rows
df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


Resource for dataset: https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset

Understanding the Data


In [None]:
# Check column names
print(df.columns)

# Select relevant features
features = ["danceability", "energy", "tempo", "valence"]
target = "popularity"

# Check data types
df[features + [target]].info()


Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
       'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   danceability  114000 non-null  float64
 1   energy        114000 non-null  float64
 2   tempo         114000 non-null  float64
 3   valence       114000 non-null  float64
 4   popularity    114000 non-null  int64  
dtypes: float64(4), int64(1)
memory usage: 4.3 MB


4️⃣ Training a Machine Learning Model
Let’s train a model to predict song popularity based on danceability, tempo, and energy.

Prepare Data for Training:

In [None]:
# Remove missing values
df = df.dropna(subset=[target])

# Split data
X = df[features]  # Features (inputs)
y = df[target]  # Target (output)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training size: {len(X_train)}, Test size: {len(X_test)}")


Training size: 91200, Test size: 22800


Train the Model:

In [None]:
# Initialize and train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")


Mean Absolute Error: 11.57


5️⃣ Making Predictions
Now, let’s predict the popularity of a new song!

In [None]:
# Define feature names
feature_names = ["danceability", "energy", "tempo", "valence"]

# Create a DataFrame with the same feature names
new_song = pd.DataFrame([[0.8, 0.75, 120, 0.6]], columns=feature_names)

# Make prediction
predicted_popularity = model.predict(new_song)[0]
print(f"Predicted Popularity: {predicted_popularity:.2f}")

Predicted Popularity: 34.23
