# ðŸŽµ Spotify Data Analysis & Machine Learning Project

**Ã–ÄŸrenci:** Firuze EroÄŸlu (201613709044)  
**Ders:** Machine Learning (Fall 2025)

Bu proje, Spotify verilerini kullanarak uÃ§tan uca Ã¼Ã§ farklÄ± makine Ã¶ÄŸrenmesi problemini ele almaktadÄ±r:
1.  **SÄ±nÄ±flandÄ±rma (Classification):** ÅžarkÄ± Ã¶zelliklerine gÃ¶re popÃ¼lerlik tahmini.
2.  **Regresyon (Regression):** ÅžarkÄ±larÄ±n popÃ¼lerlik puanÄ±nÄ±n (0-100) tahmini.
3.  **KÃ¼meleme (Clustering):** ÅžarkÄ±larÄ±n dinlenme sayÄ±larÄ±na gÃ¶re segmentasyonu.

---

## 1. KÃ¼tÃ¼phanelerin YÃ¼klenmesi ve Ayarlar
Gerekli olan `pandas`, `sklearn`, `xgboost` gibi veri bilimi kÃ¼tÃ¼phaneleri iÃ§e aktarÄ±lÄ±r.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
import time

# Scikit-Learn ModÃ¼lleri
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, 
                             roc_auc_score, _classification_report, r2_score, mean_squared_error)
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingRegressor
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import xgboost as xgb
import shap

# UyarÄ±larÄ± kapat (Temiz Ã§Ä±ktÄ± iÃ§in)
warnings.filterwarnings('ignore')

## 2. Veri Setlerinin YÃ¼klenmesi (Otomatik)
Bu kod bloÄŸu, Ã§alÄ±ÅŸma ortamÄ±nÄ± (Local veya Google Colab) algÄ±lar. EÄŸer veri setleri yerel dizinde yoksa, **GitHub deposundan otomatik olarak indirir.**

In [None]:
# --- VERÄ° Ä°NDÄ°RME FONKSÄ°YONU ---
def load_data_from_github(filename, url):
    # 1. Dosya yerelde var mÄ± kontrol et
    if not os.path.exists(filename):
        print(f"[BILGI] '{filename}' bulunamadÄ±, GitHub'dan indiriliyor...")
        try:
            # Wget komutu ile dosyayÄ± Ã§ek
            os.system(f'wget -O "{filename}" "{url}"')
            print("[BASARILI] Ä°ndirme tamamlandÄ±.")
        except Exception as e:
            print(f"[HATA] Ä°ndirme baÅŸarÄ±sÄ±z: {e}")
    else:
        print(f"[BILGI] '{filename}' zaten mevcut, yÃ¼kleniyor...")
    
    # 2. DosyayÄ± Oku (Encoding hatasÄ±na karÅŸÄ± Ã¶nlem alarak)
    try:
        df = pd.read_csv(filename)
    except:
        df = pd.read_csv(filename, encoding='ISO-8859-1')
    return df

# URL TanÄ±mlarÄ±
URL_CLASS = "https://raw.githubusercontent.com/frzerxz/spotify-ml-analysis/main/dataset.csv"
URL_REG = "https://raw.githubusercontent.com/frzerxz/spotify-ml-analysis/main/spotify_data%20clean.csv"
URL_CLUST = "https://raw.githubusercontent.com/frzerxz/spotify-ml-analysis/main/Most_Streamed_Spotify_Songs_2024.csv"



## 3. SÄ±nÄ±flandÄ±rma GÃ¶revi (Classification)
**AmaÃ§:** ÅžarkÄ±nÄ±n ses Ã¶zelliklerine (dans edilebilirlik, enerji vb.) bakarak 'PopÃ¼ler' olup olmadÄ±ÄŸÄ±nÄ± tahmin etmek.

In [None]:
# 1. Veriyi YÃ¼kle
df_class = load_data_from_github('dataset.csv', URL_CLASS)
print(f"Veri Seti Boyutu: {df_class.shape}")

# 2. Hedef DeÄŸiÅŸkenin OluÅŸturulmasÄ±
# PopÃ¼laritesi 50'den bÃ¼yÃ¼k olanlar '1' (PopÃ¼ler), diÄŸerleri '0' (PopÃ¼ler DeÄŸil)
df_class['is_popular'] = (df_class['popularity'] > 50).astype(int)

# 3. Gereksiz SÃ¼tunlarÄ±n Temizlenmesi
# ID, Ä°sim gibi makine Ã¶ÄŸrenmesine katkÄ±sÄ± olmayan sÃ¼tunlar Ã§Ä±karÄ±lÄ±r.
cols_to_drop = ['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name', 'popularity', 'track_genre']
X = df_class.drop(columns=[c for c in cols_to_drop if c in df_class.columns])
X = X.drop(columns=['is_popular']) # Hedef deÄŸiÅŸkeni Ã¶zelliklerden ayÄ±r
y = df_class['is_popular']

# Explicit sÃ¼tununu sayÄ±sal hale getir
if 'explicit' in X.columns:
    X['explicit'] = X['explicit'].astype(int)

print("Ã–znitelik SeÃ§imi TamamlandÄ±.")



### Veri Ã–n Ä°ÅŸleme ve Dengesizlik Giderme
Eksik veriler doldurulur, veriler Ã¶lÃ§eklenir ve `RandomUnderSampler` ile sÄ±nÄ±f dengesizliÄŸi giderilir.

In [None]:
# EÄŸitim ve Test AyrÄ±mÄ± (%80 EÄŸitim, %20 Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Pipeline TanÄ±mlama
# SayÄ±sal veriler iÃ§in: Eksik verileri doldur -> StandartlaÅŸtÄ±r (Scale)
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'bool']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_features),
        ('cat', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)
    ])

# Dengesiz Veri YÃ¶netimi (Undersampling)
# PopÃ¼ler olmayan ÅŸarkÄ± sayÄ±sÄ± Ã§ok fazla olduÄŸu iÃ§in, popÃ¼ler olanlarla eÅŸitliyoruz.
try:
    from imblearn.under_sampling import RandomUnderSampler
    rus = RandomUnderSampler(random_state=42)
    # Pipeline uygulamadan Ã¶nce veriyi dengelemek en saÄŸlÄ±klÄ±sÄ±dÄ±r
    print(f"Resampling Ã–ncesi SÄ±nÄ±f DaÄŸÄ±lÄ±mÄ±: {np.bincount(y_train)}")
    X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)
    print(f"Resampling SonrasÄ± SÄ±nÄ±f DaÄŸÄ±lÄ±mÄ±: {np.bincount(y_train_resampled)}")
    
    X_train = X_train_resampled
    y_train = y_train_resampled
except ImportError:
    print("[UYARI] imblearn kÃ¼tÃ¼phanesi bulunamadÄ±, orjinal veri ile devam ediliyor.")

# Veriyi DÃ¶nÃ¼ÅŸtÃ¼r (Fit & Transform)
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
print("Ã–n iÅŸleme tamamlandÄ±.")



### Model EÄŸitimi ve DeÄŸerlendirme
Random Forest modeli eÄŸitilir ve sonuÃ§lar raporlanÄ±r.

In [None]:
# Model EÄŸitimi: Random Forest Classifier
print("Random Forest modeli eÄŸitiliyor...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_processed, y_train)

# Tahmin
y_pred = rf_model.predict(X_test_processed)

# SonuÃ§lar
print("--- SÄ±nÄ±flandÄ±rma SonuÃ§larÄ± ---")
print(f"DoÄŸruluk (Accuracy): {accuracy_score(y_test, y_pred):.4f}")
# DetaylÄ± Rapor (Precision, Recall, F1)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))



## 4. Regresyon GÃ¶revi (Regression)
**AmaÃ§:** ÅžarkÄ±nÄ±n ve sanatÃ§Ä±nÄ±n metadatalarÄ±nÄ± kullanarak 0-100 arasÄ±ndaki 'PopÃ¼lerlik PuanÄ±nÄ±' net olarak tahmin etmek.

In [None]:
# 1. Veriyi YÃ¼kle
df_reg = load_data_from_github('spotify_data clean.csv', URL_REG)
print(f"Regresyon Veri Seti Boyutu: {df_reg.shape}")

# 2. Ã–zellik SeÃ§imi
target_col = 'track_popularity'
# Metin tabanlÄ± ve gereksiz sÃ¼tunlarÄ± Ã§Ä±karÄ±yoruz
cols_to_drop = ['track_id', 'track_name', 'artist_name', 'album_id', 'album_name', 'album_release_date', 'artist_genres']
X_reg = df_reg.drop(columns=[c for c in cols_to_drop if c in df_reg.columns])
X_reg = X_reg.drop(columns=[target_col])
y_reg = df_reg[target_col]

# 3. Pipeline HazÄ±rlÄ±ÄŸÄ±
numeric_features_reg = X_reg.select_dtypes(include=['int64', 'float64']).columns
categorical_features_reg = X_reg.select_dtypes(include=['object', 'bool']).columns

reg_preprocessor = ColumnTransformer(transformers=[
    ('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_features_reg),
    ('cat', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), categorical_features_reg)
])

# EÄŸitim/Test AyrÄ±mÄ±
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)



In [None]:
# Model EÄŸitimi: Random Forest Regressor
print("Regresyon Modeli EÄŸitiliyor (Bu iÅŸlem birkaÃ§ saniye sÃ¼rebilir)...")
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Pipeline OluÅŸtur (Ã–n Ä°ÅŸleme + Model)
reg_pipeline = Pipeline(steps=[('preprocessor', reg_preprocessor), ('regressor', rf_reg)])
reg_pipeline.fit(X_train_reg, y_train_reg)

# Tahmin ve DeÄŸerlendirme
y_pred_reg = reg_pipeline.predict(X_test_reg)

r2 = r2_score(y_test_reg, y_pred_reg)
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))

print(f"--- Regresyon SonuÃ§larÄ± ---")
print(f"R2 Skoru (BaÅŸarÄ± OranÄ±): {r2:.4f}")
print(f"RMSE (Ortalama Hata PayÄ±): {rmse:.2f}")



## 5. KÃ¼meleme GÃ¶revi (Clustering)
**AmaÃ§:** ÅžarkÄ±larÄ± dinlenme sayÄ±larÄ±na (Streams) ve sosyal medya etkileÅŸimlerine (TikTok/YouTube) gÃ¶re gruplara ayÄ±rmak.
Bu veri seti etiketli deÄŸildir (Unsupervised Learning).

In [None]:
# 1. Veriyi YÃ¼kle
filename_clust = 'Most_Streamed_Spotify_Songs_2024.csv'
df_clust = load_data_from_github(filename_clust, URL_CLUST)

# 2. Veri TemizliÄŸi
# SayÄ±sal olmasÄ± gereken sÃ¼tunlardaki virgÃ¼lleri (Ã¶rn: "1,000") temizliyoruz.
cols_to_clean = ['Spotify Streams', 'Spotify Playlist Count', 'YouTube Views', 'TikTok Views']
for col in cols_to_clean:
    if col in df_clust.columns and df_clust[col].dtype == object:
        df_clust[col] = pd.to_numeric(df_clust[col].astype(str).str.replace(',', ''), errors='coerce')

# Sadece sayÄ±sal sÃ¼tunlarÄ± al ve eksikleri 0 ile doldur
X_clust = df_clust[cols_to_clean].fillna(0)

# Veri Ã§ok bÃ¼yÃ¼kse hÄ±z iÃ§in Ã¶rneklem al
if len(X_clust) > 5000:
    X_clust = X_clust.iloc[:5000]

# 3. Ã–lÃ§ekleme (Clustering iÃ§in kritiktir)
scaler_clust = StandardScaler()
X_clust_scaled = scaler_clust.fit_transform(X_clust)

# 4. K-Means Modelleme
print("K-Means KÃ¼meleme Ã‡alÄ±ÅŸtÄ±rÄ±lÄ±yor (K=3)...")
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_clust_scaled)

# 5. BaÅŸarÄ± MetriÄŸi (Silhouette Score)
sil_score = silhouette_score(X_clust_scaled, labels)
print(f"--- KÃ¼meleme Sonucu ---")
print(f"Silhouette Skoru: {sil_score:.4f} (1'e ne kadar yakÄ±nsa o kadar iyi ayrÄ±ÅŸmÄ±ÅŸ demektir)")

