# Google Review Ratings - Clustering Analizi

## üìä Proje √ñzeti

Bu notebook, **Google Travel Review Ratings** verisi √ºzerinde kapsamlƒ± bir **clustering (k√ºmeleme) analizi** ger√ßekle≈ütirmektedir. Analiz, kullanƒ±cƒ±larƒ±n farklƒ± seyahat kategorilerindeki deƒüerlendirme davranƒ±≈ülarƒ±nƒ± inceleyerek benzer kullanƒ±cƒ±larƒ± gruplara ayƒ±rmayƒ± ama√ßlamaktadƒ±r.

**Veri Kaynaƒüƒ±:** [UCI Machine Learning Repository - Travel Review Ratings](https://archive.ics.uci.edu/dataset/485/tarvel+review+ratings)

---

## üìã Veri Seti Hakkƒ±nda

### Genel Bilgiler
- **Toplam Kullanƒ±cƒ± Sayƒ±sƒ±:** 5,456 kullanƒ±cƒ±
- **Kategori Sayƒ±sƒ±:** 24 farklƒ± seyahat kategorisi
- **Puanlama Sistemi:** 0-5 arasƒ± (0 = deƒüerlendirme yok, 1-5 = puanlama)
- **Veri Tipi:** Kullanƒ±cƒ± bazlƒ± deƒüerlendirme matrisi

### Analiz Y√∂ntemleri
Bu notebook'ta a≈üaƒüƒ±daki clustering algoritmalarƒ± uygulanacaktƒ±r:
1. **K-Means Clustering**
2. **Hierarchical Clustering (Agglomerative)**
3. **DBSCAN**
4. **PCA ile Boyut ƒ∞ndirgeme ve G√∂rselle≈ütirme**


## üì¶ 1. K√ºt√ºphanelerin ƒ∞√ße Aktarƒ±lmasƒ±


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

# G√∂rselle≈ütirme ayarlarƒ±
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ K√ºt√ºphaneler ba≈üarƒ±yla y√ºklendi!")


‚úÖ K√ºt√ºphaneler ba≈üarƒ±yla y√ºklendi!


## üì• 2. Veri Y√ºkleme ve ƒ∞nceleme


In [None]:
# Veri y√ºkleme
df = pd.read_csv('data/google_review_ratings.csv')

print("üìä Veri Seti Bilgileri:")
print("=" * 50)
print(f"Satƒ±r Sayƒ±sƒ±: {df.shape[0]}")
print(f"S√ºtun Sayƒ±sƒ±: {df.shape[1]}")
print(f"\nƒ∞lk 5 Satƒ±r:")
print(df.head())
print(f"\nVeri Tipleri:")
print(df.dtypes)
print(f"\nEksik Deƒüerler:")
print(df.isnull().sum().sum())
print(f"\nTemel ƒ∞statistikler:")
print(df.describe())


üìä Veri Seti Bilgileri:
Satƒ±r Sayƒ±sƒ±: 5456
S√ºtun Sayƒ±sƒ±: 26

ƒ∞lk 5 Satƒ±r:
     User  Category 1  Category 2  Category 3  Category 4  Category 5  \
0  User 1         0.0         0.0        3.63        3.65         5.0   
1  User 2         0.0         0.0        3.63        3.65         5.0   
2  User 3         0.0         0.0        3.63        3.63         5.0   
3  User 4         0.0         0.5        3.63        3.63         5.0   
4  User 5         0.0         0.0        3.63        3.63         5.0   

   Category 6  Category 7  Category 8  Category 9  ...  Category 16  \
0        2.92         5.0        2.35        2.33  ...         0.59   
1        2.92         5.0        2.64        2.33  ...         0.59   
2        2.92         5.0        2.64        2.33  ...         0.59   
3        2.92         5.0        2.35        2.33  ...         0.59   
4        2.92         5.0        2.64        2.33  ...         0.59   

  Category 17  Category 18  Category 19  Category 

## üîß 3. Veri √ñn ƒ∞≈üleme


In [None]:
# User s√ºtununu index olarak ayarla
df.set_index('User', inplace=True)

# Kategori s√ºtunlarƒ±nƒ± al
category_columns = [col for col in df.columns if 'Category' in col]
print(f"Toplam kategori sayƒ±sƒ±: {len(category_columns)}")

# Eksik deƒüerleri kontrol et
print(f"\nEksik deƒüer sayƒ±sƒ±: {df[category_columns].isnull().sum().sum()}")

# 0 deƒüerlerini (deƒüerlendirme yapƒ±lmamƒ±≈ü) NaN olarak i≈üaretle (opsiyonel)
# df[category_columns] = df[category_columns].replace(0, np.nan)

# Veri setini hazƒ±rla
X = df[category_columns].copy()

# Eksik deƒüerleri ortalama ile doldur
imputer = SimpleImputer(strategy='mean')
X_imputed = pd.DataFrame(
    imputer.fit_transform(X),
    columns=X.columns,
    index=X.index
)

print(f"\n‚úÖ Veri √∂n i≈üleme tamamlandƒ±!")
print(f"Veri boyutu: {X_imputed.shape}")


Toplam kategori sayƒ±sƒ±: 24

Eksik deƒüer sayƒ±sƒ±: 2


ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: '2\t2.'

: 

## üìà 4. Veri G√∂rselle≈ütirme ve Ke≈üifsel Analiz


In [None]:
# Kategorilerin ortalama puanlarƒ±
category_means = X_imputed.mean().sort_values(ascending=False)

plt.figure(figsize=(14, 8))
category_means.plot(kind='barh', color='steelblue')
plt.title('Kategorilerin Ortalama Puanlarƒ±', fontsize=16, fontweight='bold')
plt.xlabel('Ortalama Puan', fontsize=12)
plt.ylabel('Kategori', fontsize=12)
plt.tight_layout()
plt.show()

print("Kategorilerin Ortalama Puanlarƒ±:")
print(category_means.round(2))


In [None]:
# Korelasyon matrisi
plt.figure(figsize=(16, 12))
correlation_matrix = X_imputed.corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Kategoriler Arasƒ± Korelasyon Matrisi', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()


## üéØ 5. Veri Standardizasyonu


In [None]:
# Veriyi standardize et
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
X_scaled_df = pd.DataFrame(X_scaled, columns=X_imputed.columns, index=X_imputed.index)

print("‚úÖ Veri standardizasyonu tamamlandƒ±!")
print(f"Standardize edilmi≈ü veri boyutu: {X_scaled_df.shape}")
print(f"\nStandardize edilmi≈ü veri istatistikleri:")
print(X_scaled_df.describe().round(2))


## üîç 6. Optimal K√ºme Sayƒ±sƒ±nƒ±n Belirlenmesi (Elbow Method & Silhouette Score)


In [None]:
# Elbow Method ve Silhouette Score ile optimal k√ºme sayƒ±sƒ±nƒ± bul
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

# G√∂rselle≈ütirme
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Elbow Method
ax1.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('K√ºme Sayƒ±sƒ± (k)', fontsize=12)
ax1.set_ylabel('Inertia (WCSS)', fontsize=12)
ax1.set_title('Elbow Method - Optimal K√ºme Sayƒ±sƒ±', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Silhouette Score
ax2.plot(K_range, silhouette_scores, 'ro-', linewidth=2, markersize=8)
ax2.set_xlabel('K√ºme Sayƒ±sƒ± (k)', fontsize=12)
ax2.set_ylabel('Silhouette Score', fontsize=12)
ax2.set_title('Silhouette Score - Optimal K√ºme Sayƒ±sƒ±', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# En iyi k deƒüerini bul
best_k = K_range[np.argmax(silhouette_scores)]
print(f"‚úÖ En y√ºksek Silhouette Score: {max(silhouette_scores):.4f} (k={best_k})")
print(f"\nT√ºm Silhouette Skorlarƒ±:")
for k, score in zip(K_range, silhouette_scores):
    print(f"  k={k}: {score:.4f}")


## üé® 7. K-Means Clustering


In [None]:
# Optimal k√ºme sayƒ±sƒ± ile K-Means
optimal_k = best_k
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Sonu√ßlarƒ± DataFrame'e ekle
results_df = X_imputed.copy()
results_df['KMeans_Cluster'] = kmeans_labels

print(f"‚úÖ K-Means clustering tamamlandƒ±! (k={optimal_k})")
print(f"\nK√ºme Daƒüƒ±lƒ±mƒ±:")
print(results_df['KMeans_Cluster'].value_counts().sort_index())

# Metrikleri hesapla
silhouette_kmeans = silhouette_score(X_scaled, kmeans_labels)
davies_bouldin_kmeans = davies_bouldin_score(X_scaled, kmeans_labels)
calinski_harabasz_kmeans = calinski_harabasz_score(X_scaled, kmeans_labels)

print(f"\nüìä K-Means Metrikleri:")
print(f"  Silhouette Score: {silhouette_kmeans:.4f}")
print(f"  Davies-Bouldin Index: {davies_bouldin_kmeans:.4f}")
print(f"  Calinski-Harabasz Score: {calinski_harabasz_kmeans:.4f}")


In [None]:
# K√ºmelerin √∂zelliklerini analiz et
cluster_means = results_df.groupby('KMeans_Cluster')[category_columns].mean()

plt.figure(figsize=(16, 8))
cluster_means.T.plot(kind='bar', figsize=(16, 8))
plt.title('K√ºmelerin Kategori Ortalamalarƒ± (K-Means)', fontsize=16, fontweight='bold')
plt.xlabel('Kategori', fontsize=12)
plt.ylabel('Ortalama Puan', fontsize=12)
plt.legend(title='K√ºme', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("K√ºmelerin Kategori Ortalamalarƒ±:")
print(cluster_means.round(2))
