<a href="https://www.kaggle.com/code/syedfarazhussaini/playground-series-s5e9?scriptVersionId=260775068" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<a href="https://www.kaggle.com/code/syedfarazhussaini/playground-series-s5e9?scriptVersionId=260768283" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Step-by-Step Guide: Predicting the Beats-per-Minute of Songs
Welcome to the Kaggle Playground Series S5E9! This notebook will guide you through the process of building a model to predict the beats-per-minute (BPM) of songs.

## Steps
1. **Understand the Problem**  
   - Review the competition goal and data format.
2. **Import Libraries & Load Data**  
   - Import necessary Python libraries.
   - Load the training and test datasets.
3. **Explore the Data (EDA)**  
   - Inspect the data structure, check for missing values, and visualize distributions.
4. **Preprocess the Data**  
   - Handle missing values, encode categorical variables, and scale features if needed.
5. **Build a Baseline Model**  
   - Train a simple regression model (e.g., Linear Regression, Random Forest, or XGBoost).
6. **Evaluate the Model**  
   - Use cross-validation or a validation split to assess performance.
7. **Feature Engineering & Model Improvement**  
   - Try new features, different models, or hyperparameter tuning to improve results.
8. **Make Predictions on Test Data**  
   - Generate predictions for the test set.
9. **Prepare Submission**  
   - Format predictions according to `sample_submission.csv` and save for submission.
10. **Submit to Kaggle**  
   - Upload your submission and review your score.

Let's get started!

## 1. Import Libraries & Load Data
Let's import the necessary libraries and load the training, test, and sample submission datasets.

In [None]:
# Import libraries
import numpy as np
import pandas as pd

# Load datasets
filePath = "/kaggle/input/playground-series-s5e9"
train = pd.read_csv(f'{filePath}/train.csv')
test = pd.read_csv(f'{filePath}/test.csv')
sample_submission = pd.read_csv(f'{filePath}/sample_submission.csv')

# Show the shape of the datasets
print('Train shape:', train.shape)
print('Test shape:', test.shape)
print('Sample submission shape:', sample_submission.shape)

# Display the first few rows of the training data
train.head()

## 2. Explore the Data (EDA)
Let's start exploring the data. We'll begin by checking the structure, types, and summary statistics of the training set.

### 2.1. Data Structure and Types
Let's check the columns, data types, and missing values in the training data.

In [None]:
# Show columns and data types
print(train.dtypes)

# Check for missing values
print('\nMissing values per column:')
print(train.isnull().sum())

### 2.2. Summary Statistics
Now, let's look at summary statistics for the numeric features in the training data.

In [None]:
# Summary statistics for numeric columns
train.describe()

### 2.3. Target Variable Distribution
Let's visualize the distribution of the target variable (BPM) to understand its range and shape.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of the target variable (assuming 'BeatsPerMinute' is the target column)
plt.figure(figsize=(8, 4))
sns.histplot(train['BeatsPerMinute'], kde=True, bins=30)
plt.title('Distribution of BPM (Target Variable)')
plt.xlabel('BPM')
plt.ylabel('Frequency')
plt.show()

### 2.4. Feature Distributions and Relationships
Let's visualize the distributions of some features and their relationships with the target variable (BPM).

In [None]:
# Plot distributions for a few numeric features and their relationship with BPM
numeric_features = train.select_dtypes(include=[np.number]).columns.tolist()
numeric_features = [f for f in numeric_features if f != 'BeatsPerMinute']  # Exclude target
sample_features = numeric_features[:3]  # Plot first 3 features as example

fig, axes = plt.subplots(len(sample_features), 2, figsize=(12, 4 * len(sample_features)))
for i, feature in enumerate(sample_features):
    # Distribution
    sns.histplot(train[feature], kde=True, ax=axes[i, 0])
    axes[i, 0].set_title(f'Distribution of {feature}')
    # Relationship with BeatsPerMinute
    sns.scatterplot(x=train[feature], y=train['BeatsPerMinute'], ax=axes[i, 1], alpha=0.3)
    axes[i, 1].set_title(f'{feature} vs BeatsPerMinute')
plt.tight_layout()
plt.show()

## 3. Preprocess the Data
Let's prepare the data for modeling. We'll start by handling missing values, then encode categorical variables, and finally scale features if needed.

### 3.1. Handle Missing Values
First, let's check again for missing values and decide how to handle them. We'll fill or drop missing values as appropriate.

In [None]:
# Handle missing values (example: fill numeric with median, categorical with mode)
for col in train.columns:
    if train[col].isnull().sum() > 0:
        if train[col].dtype == 'object':
            mode = train[col].mode()[0]
            train[col].fillna(mode, inplace=True)
            test[col].fillna(mode, inplace=True)
        else:
            median = train[col].median()
            train[col].fillna(median, inplace=True)
            test[col].fillna(median, inplace=True)

### 3.2. Encode Categorical Variables
Next, we'll convert any categorical features into numeric format using one-hot encoding.

In [None]:
# One-hot encode categorical variables
categorical_cols = train.select_dtypes(include=['object']).columns.tolist()
train_encoded = pd.get_dummies(train, columns=categorical_cols)
test_encoded = pd.get_dummies(test, columns=categorical_cols)

# Align train and test dataframes to have the same columns
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

### 3.3. Feature Scaling (Optional)
For some models, scaling features can improve performance. We'll use StandardScaler as an example.

In [None]:
from sklearn.preprocessing import StandardScaler

# Identify feature columns (exclude target and any ID columns)
target_col = 'BeatsPerMinute'  # Update if your target column is named differently
feature_cols = [col for col in train_encoded.columns if col != target_col]

scaler = StandardScaler()
train_encoded[feature_cols] = scaler.fit_transform(train_encoded[feature_cols])
test_encoded[feature_cols] = scaler.transform(test_encoded[feature_cols])

## 4. Build a Baseline Model
Let's train a simple regression model as a baseline. We'll use a Random Forest Regressor and evaluate its performance using cross-validation.

In [16]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [22]:
# Split features and target
X = train_encoded[feature_cols]
y = train_encoded[target_col]

In [17]:


# # Faster Random Forest for testing
# rf = RandomForestRegressor(n_estimators=10, random_state=42)

# # Faster cross-validation
# scores = cross_val_score(rf, X, y, cv=3, scoring='neg_root_mean_squared_error')
# print('Cross-validated RMSE (fast version):', -scores.mean())

# # Fit the model on the full training data
# rf.fit(X, y)

# # Predict on the test set
# test_preds = rf.predict(test_encoded[feature_cols])

# # Prepare the submission DataFrame (make sure the column name matches sample_submission)
# submission = sample_submission.copy()
# submission['BeatsPerMinute'] = test_preds  # Update column name if needed

# # Save to CSV for Kaggle submission
# submission.to_csv('submission.csv', index=False)
# print("Submission file 'submission.csv' created!")

## 5. Build a LightGBM Model
Let's train a LightGBM model. To see its improvement over Random Forest Regressor.

In [19]:
# !pip install lightgbm

In [23]:
import lightgbm as lgb

# Create the LightGBM regressor
lgbm = lgb.LGBMRegressor(n_estimators=100, random_state=42)

# Cross-validation (same as before, 3 folds, negative RMSE)
lgbm_scores = cross_val_score(lgbm, X, y, cv=3, scoring='neg_root_mean_squared_error')
print('LightGBM Cross-validated RMSE:', -lgbm_scores.mean())

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005548 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 349442, number of used features: 10
[LightGBM] [Info] Start training from score 119.053707
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005590 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 349443, number of used features: 10
[LightGBM] [Info] Start training from score 118.999012
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005750 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not

In [24]:
# Fit LightGBM on the full training data
lgbm.fit(X, y)

# Predict on the test set
lgbm_test_preds = lgbm.predict(test_encoded[feature_cols])

# Prepare the submission DataFrame (make sure the column name matches sample_submission)
lgbm_submission = sample_submission.copy()
lgbm_submission['BeatsPerMinute'] = lgbm_test_preds  # Update column name if needed

# Save to CSV for Kaggle submission
lgbm_submission.to_csv('submission.csv', index=False)
print("Submission file from lgbm model 'submission.csv' created!")

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.030257 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 524164, number of used features: 10
[LightGBM] [Info] Start training from score 119.034899
Submission file from lgbm model 'submission.csv' created!
