# Preprocesing and Feature Engineering

This notebook provides a comprehensive preprocessing pipeline for the Spotify dataset. The goal is to transform raw data into a structured, clean, and feature-rich format suitable for machine learning models. The preprocessing steps ensure the dataset is consistent, relevant, and optimized for analysis and training.

## Importing Libraries

- **Purpose:** Import necessary libraries for data manipulation, transformation, and preprocessing.
- **Libraries Used:**
  - `pandas`: For data manipulation and preprocessing.
  - `numpy`: For numerical computations.
  - `datetime`: To handle and compute date-related features.
  - `sklearn.preprocessing`: For scaling and generating interaction terms.

In [43]:
# Importing the libraries
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
import pandas as pd
from datetime import datetime
import numpy as np

## Loading the Dataset
- **File Source:** The dataset is loaded from a CSV file (`spotify-2023-cleaned.csv`) using `pandas.read_csv`.
- **Encoding:** Explicit encoding is specified (`ISO-8859-1`) to handle potential special characters in the dataset.
- **Next Steps:** Begin cleaning and preprocessing the dataset.

In [44]:
# Importing the dataset
spotify_data = pd.read_csv(
    'dataset/spotify-2023-cleaned.csv', encoding='ISO-8859-1')

## Dropping Irrelevant Columns
- **Columns Dropped:**
  - `track_id`
  - `track_name`
  - `artist_name`
- **Reason:** These columns are not relevant to the analysis and modeling process.

In [45]:
# Drop irrelevant columns
columns_to_drop = ['track_id', 'track_name', 'artist_name']
spotify_data.drop(columns=columns_to_drop, inplace=True)

## Combining Release Date into a Single Feature
- **Steps:**
  - Convert `released_year`, `released_month`, and `released_day` into a single `release_date`.
  - Calculate `days_since_release` using the current date (2024-12-01) as a reference.
  - Drop the original date-related columns after computation.
- **Purpose:** Convert discrete date features into a single numerical feature that captures the recency of the release.

In [46]:
# Combine release date into a single feature: "days_since_release"
current_date = datetime(2024, 12, 1)
spotify_data['release_date'] = pd.to_datetime({
    'year': spotify_data['released_year'],
    'month': spotify_data['released_month'],
    'day': spotify_data['released_day']
})
spotify_data['days_since_release'] = (
    current_date - spotify_data['release_date']).dt.days
spotify_data.drop(columns=['release_date', 'released_year',
                  'released_month', 'released_day'], inplace=True)

## Normalizing Numerical Features
- **Columns Normalized:**
  - `bpm`, `danceability`, `valence`, `energy`, `streams`, `acousticness`, `instrumentalness`, `liveness`, `speechiness`, `days_since_release`.
- **Method:** `MinMaxScaler` from `sklearn` scales the values between 0 and 1.
- **Why Normalize:** To standardize the range of the numerical features for better model performance.

In [47]:
# Normalize numerical features
numerical_features = ['bpm', 'danceability', 'valence', 'energy', 'streams', 'acousticness',
                      'instrumentalness', 'liveness', 'speechiness', 'days_since_release']
scaler = MinMaxScaler()
spotify_data[numerical_features] = scaler.fit_transform(
    spotify_data[numerical_features])

## Encoding Categorical Features
- **Features Encoded:**
  - `key` (musical key)
  - `mode` (Major or Minor mode)
- **Method:** One-hot encoding with `drop_first=True` to avoid multicollinearity.

In [48]:
# One-hot encode categorical features
categorical_features = ['key', 'mode']
spotify_data = pd.get_dummies(
    spotify_data, columns=categorical_features, drop_first=True)

## Summarizing Platform Presence
- **Steps:**
  - Identify columns related to platform presence (e.g., playlist or chart-related features).
  - Create a summary feature `platform_presence_summary` by summing all platform-related columns.
  - Drop the individual platform columns after summarization.
- **Purpose:** Reduce dimensionality while retaining the essence of platform-related data.

In [49]:
# Summarize platform presence
platform_columns = [col for col in spotify_data.columns if 'playlist' in col.lower(
) or 'chart' in col.lower()]
if platform_columns:
    spotify_data['platform_presence_summary'] = spotify_data[platform_columns].sum(
        axis=1)
    
# Drop individual platform columns after summarization
spotify_data.drop(columns=platform_columns, inplace=True)

## Log Transformation of Streams
- **Purpose:** Apply `log1p` transformation to reduce skewness in the `streams` feature, making it more suitable for modeling.
- **Steps:**
  - Transform `streams` into `log_streams`.
  - Drop the original `streams` column after transformation.

In [50]:
# Log-transform the streams feature to reduce skewness (if applicable)
if 'streams' in spotify_data.columns:
    spotify_data['log_streams'] = np.log1p(spotify_data['streams'])
    spotify_data.drop(columns=['streams'], inplace=True)

## Generating Interaction Terms
- **Selected Features:** 
  - `platform_presence_summary`, `danceability`, `energy`, `bpm`, `valence`, `artist_count`.
- **Method:**
  - Use `PolynomialFeatures` with `degree=2` to generate interaction-only terms.
  - Create new features representing the interactions between the selected features.
- **Purpose:** Capture nonlinear relationships between features to improve model accuracy.

In [51]:
# Generate interaction terms for selected features
interaction_features = ['platform_presence_summary', 'danceability', 'energy', 'bpm', 'valence', 'artist_count']
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
interaction_terms = poly.fit_transform(spotify_data[interaction_features])

# Create a DataFrame for the interaction terms
interaction_feature_names = poly.get_feature_names_out(interaction_features)
interaction_terms_df = pd.DataFrame(interaction_terms, columns=interaction_feature_names)

## Merging Interaction Terms
- **Steps:**
  - Convert the generated interaction terms into a `DataFrame`.
  - Merge these new features with the original dataset.
  - Ensure no duplicate column names remain after the merge.

In [52]:
# Merge interaction terms with the original data
spotify_data = pd.concat([spotify_data, interaction_terms_df], axis=1)

# Ensure no duplicate column names
spotify_data = spotify_data.loc[:, ~spotify_data.columns.duplicated()]

## Saving the Preprocessed Dataset
- **File Saved To:** `spotify-2023-preprocessed.csv`
- **Reason:** Store the preprocessed dataset for model building and analysis.

In [53]:
# Save the preprocessed dataset
spotify_data.to_csv('dataset/spotify-2023-preprocessed.csv', index=False)

print('Preprocessing completed successfully!')

Preprocessing completed successfully!


## Dataset Summary and Inspection
- **Method:** Use `info()` to display:
  - Total number of entries.
  - Data types of columns.
  - Non-null counts to ensure no missing values.
- **Result:** The dataset contains 945 entries and 39 features, including the new interaction terms and encoded features.

In [54]:
spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 945 entries, 0 to 944
Data columns (total 39 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   artist_count                            945 non-null    int64  
 1   bpm                                     945 non-null    float64
 2   danceability                            945 non-null    float64
 3   valence                                 945 non-null    float64
 4   energy                                  945 non-null    float64
 5   acousticness                            945 non-null    float64
 6   instrumentalness                        945 non-null    float64
 7   liveness                                945 non-null    float64
 8   speechiness                             945 non-null    float64
 9   days_since_release                      945 non-null    float64
 10  key_A#                                  945 non-null    bool  

In [55]:
# Importing the libraries
from sklearn.preprocessing import StandardScaler
import pandas as pd
from datetime import datetime
import numpy as np

# Importing the dataset
spotify_data = pd.read_csv(
    'dataset/spotify-2023-cleaned.csv', encoding='ISO-8859-1')

# Drop irrelevant columns
columns_to_drop = ['track_id', 'track_name', 'artist_name']
spotify_data.drop(columns=columns_to_drop, inplace=True)

# Combine release date into a single feature: "days_since_release"
current_date = datetime(2024, 12, 1)
spotify_data['release_date'] = pd.to_datetime({
    'year': spotify_data['released_year'],
    'month': spotify_data['released_month'],
    'day': spotify_data['released_day']
})
spotify_data['days_since_release'] = (
    current_date - spotify_data['release_date']).dt.days
spotify_data.drop(columns=['release_date', 'released_year',
                  'released_month', 'released_day'], inplace=True)

# Standardize numerical features
numerical_features = ['bpm', 'danceability', 'valence', 'energy', 'streams', 'acousticness',
                      'instrumentalness', 'liveness', 'speechiness', 'days_since_release']
scaler = StandardScaler()
spotify_data[numerical_features] = scaler.fit_transform(
    spotify_data[numerical_features])

# One-hot encode categorical features
categorical_features = ['key', 'mode']
spotify_data = pd.get_dummies(
    spotify_data, columns=categorical_features, drop_first=True)

# Summarize platform presence
platform_columns = [col for col in spotify_data.columns if 'playlist' in col.lower(
) or 'chart' in col.lower()]
if platform_columns:
    spotify_data['platform_presence_summary'] = spotify_data[platform_columns].sum(
        axis=1)
    
# Drop individual platform columns after summarization
spotify_data.drop(columns=platform_columns, inplace=True)

# Log-transform the streams feature to reduce skewness (if applicable)
if 'streams' in spotify_data.columns:
    spotify_data['log_streams'] = np.log1p(spotify_data['streams'])
    spotify_data.drop(columns=['streams'], inplace=True)

# Generate interaction terms for selected features
interaction_features = ['platform_presence_summary', 'danceability', 'energy', 'bpm', 'valence', 'artist_count']
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
interaction_terms = poly.fit_transform(spotify_data[interaction_features])

# Create a DataFrame for the interaction terms
interaction_feature_names = poly.get_feature_names_out(interaction_features)
interaction_terms_df = pd.DataFrame(interaction_terms, columns=interaction_feature_names)

# Merge interaction terms with the original data
spotify_data = pd.concat([spotify_data, interaction_terms_df], axis=1)

# Ensure no duplicate column names
spotify_data = spotify_data.loc[:, ~spotify_data.columns.duplicated()]

In [56]:
# Save the preprocessed dataset
spotify_data.to_csv('dataset/spotify-2023-preprocessed-v2.csv', index=False)

print('Preprocessing completed successfully!')

Preprocessing completed successfully!


In [57]:
spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 945 entries, 0 to 944
Data columns (total 39 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   artist_count                            945 non-null    int64  
 1   bpm                                     945 non-null    float64
 2   danceability                            945 non-null    float64
 3   valence                                 945 non-null    float64
 4   energy                                  945 non-null    float64
 5   acousticness                            945 non-null    float64
 6   instrumentalness                        945 non-null    float64
 7   liveness                                945 non-null    float64
 8   speechiness                             945 non-null    float64
 9   days_since_release                      945 non-null    float64
 10  key_A#                                  945 non-null    bool  